Cancer Systems Biology - Methods and Protocols (PDFDrive)

Methods in
Molecular Biology 1711
Louise von Stechow Editor
Cancer Systems
Biology
Methods and Protocols
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences,
University of Hertfordshire, Hatfield,
Hertfordshire AL10 9AB, UK
For further volumes:

http://www.springer.com/series/7651
Cancer Systems Biology
Methods and Protocols
Edited by
Louise von Stechow

NNF Center for Protein Research, University of Copenhagen
Copenhagen, Denmark
Editor
Louise von Stechow
NNF Center for Protein Research
University of Copenhagen
Copenhagen, Denmark
ISSN 1064-3745 ISSN 1940-6029 (electronic)

Methods in Molecular Biology
ISBN 978-1-4939-7492-4 ISBN 978-1-4939-7493-1 (eBook)
https://doi.org/10.1007/978-1-4939-7493-1
Library of Congress Control Number: 2017961108
© Springer Science+Business Media LLC 2018

Chapters 5, 6 and 7 are licensed under the terms of the Creative Commons Attribution 4.0 International License (http://
creativecommons.org/licenses/by/4.0/). For further details see license information in the chapters.
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Printed on acid-free paper
This Humana Press imprint is published by Springer Nature

The registered company is Springer ScienceþBusiness Media, LLC
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
Cancer is a highly complex disease that is often characterized by vast changes in the genetic
and epigenetic landscape. Those changes result in altered protein expression levels in tumors
compared to healthy tissues. Moreover, posttranscriptional alterations lead to deregulation
of signaling processes, and altered metabolic pathways can produce aberrant metabolic
signatures in cancer cells.
A wealth of high-throughput information has emerged over the last decade, including
global measurements of genes, proteins, and metabolites, as well as many other molecular
species. Those studies provide a glimpse of the molecular makeup of cancer cells on various
levels. In order to classify tumor types and predict clinical outcomes of cancer, researchers
often employ sophisticated computational tools to extract cancer-specific events from the
excessive amounts of data that have been compiled.
This volume on “Cancer Systems Biology” comprises protocols, which describe systems
biology methodologies and computational tools, offering a variety of ways to analyze
different types of high-throughput cancer data. Those include for example network- and
pathway-based analyses. Other chapters cover descriptive and predictive mathematical mod-
els used to analyze complex cancer phenotypes and responses to anticancer drugs.
A number of chapters give an overview of data types available in large-scale data
repositories, describe state-of-the-art computational methods used, and highlight key
trends in the field of cancer systems biology.
Copenhagen, Denmark Louise von Stechow
v
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
PART I SYSTEMS BIOLOGY OF THE CANCER GENETIC AND EPIGENETIC

LANDSCAPE
1 Detection of Combinatorial Mutational Patterns in Human Cancer
Genomes by Exclusivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Hua Tan and Xiaobo Zhou
2 Discovering Altered Regulation and Signaling Through Network-based
Integration of Transcriptomic, Epigenomic, and Proteomic Tumor Data . . . . . . 13
Amanda J. Kedaigle and Ernest Fraenkel
3 Analyzing DNA Methylation Patterns During Tumor Evolution . . . . . . . . . . . . . . 27
Heng Pan and Olivier Elemento
4 MicroRNA Networks in Breast Cancer Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Andliena Tahiri, Miriam R. Aure, and Vessela N. Kristensen
5 Identifying Genetic Dependencies in Cancer by Analyzing siRNA
Screens in Tumor Cell Line Panels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
James Campbell, Colm J. Ryan, and Christopher J. Lord
PART II SYSTEMS ANALYSES OF SIGNALING NETWORKS IN CANCER CELLS

6 Phosphoproteomics-Based Profiling of Kinase Activities
in Cancer Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Jakob Wirbel, Pedro Cutillas, and Julio Saez-Rodriguez
7 Perseus: A Bioinformatics Platform for Integrative Analysis
of Proteomics Data in Cancer Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Stefka Tyanova and Juergen Cox
8 Quantitative Analysis of Tyrosine Kinase Signaling Across
Differentially Embedded Human Glioblastoma Tumors . . . . . . . . . . . . . . . . . . . . . 149
Hannah Johnson and Forest M. White
PART III SYSTEMS ANALYSIS OF CANCER CELL METABOLISM
9 Prediction of Clinical Endpoints in Breast Cancer Using NMR

Metabolic Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Leslie R. Euceda, Tonje H. Haukaas, Tone F. Bathen,
and Guro F. Giskeødegård
vii
viii Contents
PART IV SYSTEMS BIOLOGY OF METASTASIS AND TUMOR/

MICROENVIRONMENT INTERACTIONS
10 Stochastic and Deterministic Models for the Metastatic Emission Process:

Formalisms and Crosslinks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Christophe Gomez and Niklas Hartung
11 Mechanically Coupled Reaction-Diffusion Model to Predict Glioma
Growth: Methodological Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
David A. Hormuth II, Stephanie L. Eldridge, Jared A. Weis,
Michael I. Miga, and Thomas E. Yankeelov
12 Profiling Tumor Infiltrating Immune Cells with CIBERSORT . . . . . . . . . . . . . . . 243
Binbin Chen, Michael S. Khodadoust, Chih Long Liu,
Aaron M. Newman, and Ash A. Alizadeh
13 Systems Biology Approaches in Cancer Pathology . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Aaron DeWard and Rebecca J. Critchley-Thorne
PART V MODELING DRUG RESPONSES IN CANCER CELLS
14 Bioinformatics Approaches to Predict Drug Responses

from Genomic Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Neel S. Madhukar and Olivier Elemento
15 A Robust Optimization Approach to Cancer Treatment under Toxicity
Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Junfeng Zhu, Hamidreza Badri, and Kevin Leder
16 Modeling of Interactions between Cancer Stem Cells
and their Microenvironment: Predicting Clinical Response. . . . . . . . . . . . . . . . . . . 333
Mary E. Sehl and Max S. Wicha
17 Methods for High-throughput Drug Combination Screening
and Synergy Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Liye He, Evgeny Kulesskiy, Jani Saarela, Laura Turunen, Krister
Wennerberg, Tero Aittokallio, and Jing Tang
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Contributors
TERO AITTOKALLIO Institute for Molecular Medicine Finland (FIMM), University of

Helsinki, Helsinki, Finland; Department of Mathematics and Statistics, University of
Turku, Turku, Finland
ASH A. ALIZADEH Division of Oncology, Department of Medicine, Stanford Cancer
Institute, Stanford University, Stanford, CA, USA; Division of Hematology, Department
of Medicine, Stanford Cancer Institute, Stanford University, Stanford, CA, USA; Stanford
Cancer Institute, Stanford University, Stanford, CA, USA; Institute for Stem Cell Biology
and Regenerative Medicine, Stanford University, Stanford, CA, USA
MIRIAM R. AURE Department of Cancer Genetics, Institute for Cancer Research, The
Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway
HAMIDREZA BADRI Industrial and Systems Engineering, University of Minnesota,
Minneapolis, MN, USA
TONE F. BATHEN Department of Circulation and Medical Imaging - MR Center, Faculty of
Medicine and Health Sciences, NTNU - Norwegian University of Science and Technology,
Trondheim, Norway
JAMES CAMPBELL CRUK-Centre Core Bioinformatics Facility, Department of Data
Science, The Institute of Cancer Research, London, UK
BINBIN CHEN Department of Genetics, Stanford University School of Medicine, Stanford,
CA, USA
JUERGEN COX Computational Systems Biochemistry Group, Max-Planck Institute of
Biochemistry, Martinsried, Germany
REBECCA J. CRITCHLEY-THORNE Cernostics, Inc., Pittsburgh, PA, USA
PEDRO CUTILLAS Barts Cancer Institute, Queen Mary University of London, London, UK
AARON DEWARD Cernostics, Inc., Pittsburgh, PA, USA
STEPHANIE L. ELDRIDGE Institute for Computational and Engineering Sciences, The
University of Texas at Austin, Austin, TX, USA; Biomedical Engineering, The University
of Texas at Austin, Austin, TX, USA
OLIVIER ELEMENTO Department of Physiology and Biophysics, Institute for Precision
Medicine, Institute for Computational Biomedicine, Weill Cornell Medical College,
New York, NY, USA
LESLIE R. EUCEDA Department of Circulation and Medical Imaging - MR Center, Faculty
of Medicine and Health Sciences, NTNU - Norwegian University of Science and
Technology, Trondheim, Norway
ERNEST FRAENKEL Computational and Systems Biology, Massachusetts Institute of
Technology, Cambridge, MA, USA; Department of Biological Engineering, Massachusetts
Institute of Technology, Cambridge, MA, USA
GURO F. GISKEØDEGÅRD Department of Circulation and Medical Imaging - MR Center,
Faculty of Medicine and Health Sciences, NTNU - Norwegian University of Science and
Technology, Trondheim, Norway; St. Olavs Hospital, Trondheim University Hospital,
Trondheim, Norway
CHRISTOPHE GOMEZ Aix Marseille Université, CNRS, Centrale Marseille, Marseille,
France
ix
x Contributors
NIKLAS HARTUNG Department of Clinical Pharmacy and Biochemistry, Freie Universit€ at

Berlin, Berlin, Germany; Institute of Mathematics, Universit€ a t Potsdam, Potsdam,
Germany
TONJE H. HAUKAAS Department of Circulation and Medical Imaging - MR Center,
Faculty of Medicine and Health Sciences, NTNU - Norwegian University of Science and
Technology, Trondheim, Norway
LIYE HE Institute for Molecular Medicine Finland (FIMM), University of Helsinki,
Helsinki, Finland
DAVID A. HORMUTH Institute for Computational and Engineering Sciences, The University
of Texas at Austin, Austin, TX, USA
HANNAH JOHNSON Department of Biological Engineering, Massachusetts Institute of
Technology, Cambridge, MA, USA; Signalling Laboratory, The Babraham Institute,
Cambridge, UK
AMANDA J. KEDAIGLE Computational and Systems Biology, Massachusetts Institute of
Technology, Cambridge, MA, USA
MICHAEL S. KHODADOUST Division of Oncology, Department of Medicine, Stanford Cancer
Institute, Stanford University, Stanford, CA, USA; Division of Hematology, Department
of Medicine, Stanford Cancer Institute, Stanford University, Stanford, CA, USA; Stanford
Cancer Institute, Stanford University, Stanford, CA, USA
VESSELA N. KRISTENSEN Department of Clinical Molecular Biology (EpiGen), Division
of Medicine, Akershus University Hospital, Lørenskog, Norway; Department of Cancer
Genetics, Institute for Cancer Research, The Norwegian Radium Hospital, Oslo University
Hospital, Oslo, Norway
EVGENY KULESSKIY Institute for Molecular Medicine Finland (FIMM), University of
Helsinki, Helsinki, Finland
KEVIN LEDER Industrial and Systems Engineering, University of Minnesota, Minneapolis,
MN, USA
CHIH LONG LIU Division of Oncology, Department of Medicine, Stanford Cancer Institute,
Stanford University, Stanford, CA, USA
CHRISTOPHER J. LORD The Breast Cancer Now Toby Robins Breast Cancer Research Centre
and CRUK Gene Function Laboratory, The Institute of Cancer Research, London, UK
NEEL S. MADHUKAR Department of Physiology and Biophysics, Institute for Precision
Medicine, Institute for Computational Biomedicine, Weill Cornell Medical College,
New York, NY, USA
MICHAEL I. MIGA Biomedical Engineering, Vanderbilt University, Nashville, TN, USA;
Department of Radiology, Vanderbilt University, Nashville, TN, USA; Department of
Radiological Sciences, Vanderbilt University, Nashville, TN, USA; Diagnostic Medicine,
The University of Texas at Austin, Austin, TX, USA
AARON M. NEWMAN Division of Oncology, Department of Medicine, Stanford Cancer
Institute, Stanford University, Stanford, CA, USA; Institute for Stem Cell Biology and
Regenerative Medicine, Stanford University, Stanford, CA, USA
HENG PAN Department of Physiology and Biophysics, Institute for Precision Medicine,
Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY,
USA
COLM J. RYAN Systems Biology Ireland, University College Dublin, Dublin 4, Ireland
JANI SAARELA Institute for Molecular Medicine Finland (FIMM), University of Helsinki,
Helsinki, Finland
Contributors xi
JULIO SAEZ-RODRIGUEZ Joint Research Center for Computational Biomedicine

(JRC-COMBINE), Faculty of Medicine, RWTH Aachen University, Aachen, Germany;
European Molecular Biology Laboratory - European Bioinformatics Institute
(EMBL-EBI), Cambridge, UK
MARY E. SEHL Division of Hematology-Oncology, Department of Medicine, David Geffen
School of Medicine, University of California, Los Angeles, CA, USA; Department of
Biomathematics, David Geffen School of Medicine, University of California, Los Angeles,
CA, USA
ANDLIENA TAHIRI Department of Clinical Molecular Biology (EpiGen), Division of
Medicine, Akershus University Hospital, Lørenskog, Norway
HUA TAN Department of Radiology, Wake Forest School of Medicine, Center for
Bioinformatics & Systems Biology, Winston-Salem, NC, USA
JING TANG Institute for Molecular Medicine Finland (FIMM), University of Helsinki,
Helsinki, Finland; Department of Mathematics and Statistics, University of Turku, Turku,
Finland; Institute of Biomedicine, University of Helsinki, Helsinki, Finland
LAURA TURUNEN Institute for Molecular Medicine Finland (FIMM), University of
STEFKA TYANOVA Computational Systems Biochemistry Group, Max-Planck Institute of
Biochemistry, Martinsried, Germany
JARED A. WEIS Department of Biomedical Engineering, Wake Forest School of Medicine,
Winston-Salem, NC, USA; Comprehensive Cancer Center, Wake Forest Baptist Medical
Center, Winston-Salem, NC, USA
KRISTER WENNERBERG Institute for Molecular Medicine Finland (FIMM), University of
FOREST M. WHITE Department of Biological Engineering, Massachusetts Institute of
Technology, Cambridge, MA, USA; Koch Institute for Integrative Cancer Research,
Massachusetts Institute of Technology, Cambridge, MA, USA
MAX S. WICHA Department of Internal Medicine, University of Michigan, Ann Arbor, MI,
USA
JAKOB WIRBEL Joint Research Center for Computational Biomedicine (JRC-COMBINE),
Faculty of Medicine, RWTH Aachen University, Aachen, Germany; Institute for Pharmacy
and Molecular Biotechnology (IPMB), University of Heidelberg, Heidelberg, Germany
THOMAS E. YANKEELOV Institute for Computational and Engineering Sciences, The
University of Texas at Austin, Austin, TX, USA; Biomedical Engineering, The University
of Texas at Austin, Austin, TX, USA; Diagnostic Medicine, The University of Texas at
Austin, Austin, TX, USA; Livestrong Cancer Institutes, The University of Texas at Austin,
Austin, TX, USA
XIAOBO ZHOU Department of Radiology, Wake Forest School of Medicine, Center for
Bioinformatics & Systems Biology, Winston-Salem, NC, USA
JUNFENG ZHU Industrial and Systems Engineering, University of Minnesota, Minneapolis,
MN, USA
Part I
Systems Biology of the Cancer Genetic and Epigenetic

Landscape
Chapter 1
Detection of Combinatorial Mutational Patterns in Human

Cancer Genomes by Exclusivity Analysis
Hua Tan and Xiaobo Zhou
Abstract
Cancer genes may tend to mutate in a co-mutational or mutually exclusive manner in a tumor sample of a
specific cancer, which constitute two known combinatorial mutational patterns for a given gene set.
Previous studies have established that genes functioning in different signaling pathways can mutate in the
same sample, i.e., a tumor from one patient, while genes operating in the same pathway are rarely mutated
in the same cancer genome. Therefore, reliable identification of combinatorial mutational patterns of
candidate cancer genes has important ramifications in inferring signaling network modules in a particular
cancer type. While algorithms for discovering mutated driver pathways based on mutual exclusivity of
mutations in cancer genes have been proposed, a systematic pipeline for identifying both co-mutational and
mutually exclusive patterns with rational significance estimation is still lacking. Here, we describe a reliable
framework with detailed procedures to simultaneously explore both combinatorial mutational patterns
from public cross-sectional gene mutation data.
Key words Cancer genomics, Co-mutation, Mutual exclusivity, Signaling pathway, Hypergeometric test
1 Introduction
Genetic aberrations and deleterious environment exposure orches-

trate to govern the development of various human diseases includ-
ing cancer [1–7]. In particular, somatic driver mutations
accumulating in the human genome are largely recognized as the
culprit of human cancer initiation/progression [1, 2, 4]. While
numerous somatic mutations can be detected in a single tumor,
the mutations are distributed across the genome in a cancer-specific
and sample-specific manner [4, 5, 8, 9]. The cancer-specific prop-
erty refers to the scenario that mutational pattern varies between
different cancer tissue types, e.g., liver, lung and breast cancers;
while the sample-specific sense corresponds to the mutational vari-
ety between different patient samples with the same cancer type.
For a specified cancer type, some genes are altered commonly across
patient samples, while others exemplify apparent sample-specificity
Louise von Stechow (ed.), Cancer Systems Biology: Methods and Protocols, Methods in Molecular Biology, vol. 1711,
https://doi.org/10.1007/978-1-4939-7493-1_1, © Springer Science+Business Media, LLC 2018
3
4 Hua Tan and Xiaobo Zhou
Tumor Samples
Gene 1 Co-mutational
Gene 2 pattern
Gene 1 Mutually exclusive

Gene 2 pattern
Mutant Wild type
Fig. 1 Schematic representation of two combinatorial mutational patterns studied in this protocol: the
co-mutational pattern (upper panel) refers to the scenario that a set of genes tends to mutate simultaneously
in a tumor sample, whereas the mutually exclusive pattern (lower panel) represents the opposite scenario:
genes in a given set tend to avoid mutating simultaneously in any one tumor sample
[5]. Previous experimental and statistical analyses have consistently

revealed two combinatorial mutational patterns for a given set of
genes, termed co-mutational and mutually exclusive patterns
[5, 8–10]. As shown in Fig. 1, the co-mutational pattern occurs
when a set of genes tend to mutate simultaneously in a single
tumor, while the mutually exclusive pattern refers to the scenario
in which one and only one of a set of genes is likely to be altered in a
tumor.
Mutually exclusive genes are likely to function in the same
signaling pathway, whereas co-mutational genes are likely to take
effect in different pathways [11]. Combinatorial patterns of genes
can be leveraged to infer signaling networks implicated in human
cancer development and progression. Indeed, many efforts have
been devoted to de novo discover novel driver pathways based on
mutual exclusivity of gene mutations [11–13]. Therefore, it has
essential biological relevance to identify gene pairs or gene sets with
significant combinatorial mutational patterns.
Previous work proposed a statistical method to deal with this
question and nominated a number of gene sets with significant
combinatorial patterns [10]. However, this analysis was performed
on a batch of very limited cell line data. The analysis thus lacks an
elaborate procedure to preprocess data from a giant mutation
database which consists of a large number of clinical samples of
various cancer types (e.g., the recently released Catalog of Somatic
Mutations in Cancer—COSMIC [14] and the Cancer Genome
Atlas—TCGA, https://tcga-data.nci.nih.gov/tcga/). In addition,
the analysis by Yeang et al. adopted different hypothesis tests to
estimate the significance levels of the two combinatorial mutational
patterns, which tend to yield a too conservative p-value for the
co-mutational pattern [10].
To address these issues, we here describe a systematic and reliable
pipeline to identify both combinatorial mutational patterns in cancer
genomes. Here, somatic mutations exclude the synonymous point
Combinatorial Mutational Patterns in Human Cancers 5
mutations which will not change an amino acid (marked as “coding

silent” in COSMIC). Those mutations typically have little, if any,
impact on the biological function of corresponding proteins and are
uninformative for signaling pathway inference [11, 15]. Other types
of mutations, such as missense and nonsense point mutations, small
insertions and deletions, frame shifts, gene fusions, and transloca-
tions, etc., could be counted as effective mutations when performing
exclusivity analysis. Furthermore, the mutations should be detected
based on genome-wide or exome-wide screening efforts ensuring
that all protein-coding genes were covered, to minimize the statisti-
cal bias induced by incomplete sample coverage.
The step-by-step procedure for data acquisition and criteria of
data quality control, as well as the specific formulae used to calcu-
late the likelihood ratio (LR) and significance level ( p-value), are
elucidated in the following sections. Figure 2 illustrates the overall
procedures of this pipeline for mutational pattern determination.
Mutaton
records in
database
Quality control
Preprocessing
Calculate Calculate
likelihood rato signifcance
level
Determine
mutatonal paern
For Inferring
signaling
network
Fig. 2 Schematic of the overall pipeline proposed in this protocol. The specific
steps of text processing, computation, and visualization are provided in
Subheading 3
This pipeline has been shown to be highly effective and efficient in

identifying mutational patterns of gene pairs in cancer mutation
data from COSMIC v68, as described in our earlier publication
[5]. We recently applied this pipeline to analyze the data from the
latest COSMIC release (version 76), which has been threefold
expanded since the release of v68, and well recapitulated and sig-
nificantly improved the previous results (data not shown). How-
ever, our previous efforts were mainly devoted to the biological
discoveries instead of technical details of the analysis. In this proto-
col, we address this gap by providing extensive practical details and
highlighting alterative solutions when encountering problems in
the users’ particular applications.
2 Materials
The pipeline proposed in this protocol has been successfully tested

in the Catalog of Somatic Mutations in Cancer (COSMIC, release
v68 and v76). Therefore, the procedures described in the below
section will be mainly based on the COSMIC database. However, it
is noteworthy that this protocol is applicable to other databases
such as TCGA that contain information of both gene mutation and
associated patient sample IDs. Synonymous mutations and muta-
tions that are not from a genome-wide or exome-wide study should
be excluded prior to further analysis. All the text processing,
subsequent computation, and visualization can be implemented in
Matlab (The Math Works, Inc.), as described previously [5], or in R
[16], another popular language for data analysis and graphics.
3 Methods
3.1 Data Quality 1. Extract mutations of a designated cancer type from the mixed
Control mutation records in COSMIC by the keyword “Primary site”
and Preprocessing (see Note 1).
of COSMIC Mutation 2. Remove synonymous mutations by the keyword “Substitution-
Entries coding silent” (see Note 1).
3. Remove mutation records that are not from a genome-wide
study by the keyword “genome-wide screen” (see Note 1).
4. Generate a gene mutation pattern matrix based on the muta-
tions and sample IDs. The rows and columns of the matrix refer
to samples and genes, respectively. The entry at row i and
column j of the matrix refers to the number of mutations
occurring on gene j in tumor sample i. Figure 3 highlights an
example showing the 9th tumor sample has a mutation on gene
2 by marking the coordinate (9,2) (see Note 2).
Genes Genes Genes

A 1 2 3 4 5 6 7 8 9 10
B 1 2 4 7 8 9 10
C 1 2 4 7 8 9 10
12 11 10 9 8 7 6 5 4 3 2 1
12 11 10 9 8 7 6 5 4 3 2 1
12 10 9 8 7 6 5 4 3 2 1
Samples
(9,2) (9,2) (9,2)
Mutant Wild type
Fig. 3 Schematic depicting the mutation pattern matrix and entry filtering criteria. (a) A mutation pattern matrix
is generated to represent the mutation profiles of the tumor samples across all genes. A gray grid indicates the
corresponding sample has at least one mutation on the gene specified by the column ID. (b) Columns 3, 5, and
6 are deleted since the associated genes are mutated in only a small fraction of samples (the threshold of
fraction can be prescribed). (c) Row 11 is deleted as the corresponding sample has no mutation in the
remaining genes after the processing in (b)
3.2 Calculation 1. Exclude genes (columns) that do not exceed a prescribed

of Likelihood threshold of sample coverage. As shown in Fig. 3a, b, if the
of Co-occurrence percentage of nonzero entries in a column is lower than a
of Mutant Genes threshold, then delete this column (see Note 3).
2. Remove samples (rows) that do not harbor any mutations
across the remaining genes, as shown in Fig. 3b, c.
3. Calculate the likelihood ratio LRcomb of co-occurrence for each
gene pair by the formula (1):
P ðg1 ¼ 1; g2 ¼ 1Þ
LR comb ¼ ð1Þ
P ðg1 ¼ 1ÞP ðg2 ¼ 1Þ
where P(g1 ¼ 1) and P(g1 ¼ 1, g2 ¼ 1) correspond to the

percentage of samples in which a single gene or both the genes
are mutated, respectively.
4. Determine the threshold of the likelihood ratio (LRcomb) for

pattern categorization based on a mixture Gaussian distribu-
tion fitting model using an Expectation-Maximization algo-
rithm [17]. Specifically, suppose m1, m2 are the means of the
low and high components of all LRcomb’s, and δ1, δ2 their
standard deviations respectively. Then the thresholds for the
co-mutational pattern (lower bound) and exclusive pattern
(upper bound) are calculated as θ1 ¼ m2 δ2/2 and
θ2 ¼ m1 + δ1/2, respectively (see Note 4).
3.3 Calculation 1. Calculate the significance level of the co-mutational pattern Pco
of Significance by the hypergeometric test as the formula (2):
of Combinatorial
n2
X
Mutational Patterns n1 n n1 n
P co ¼ = ð2Þ
k n2 k n2
k¼n12
where n, n1, n2, n12 represent the numbers of total samples,

samples harboring gene 1 mutation, samples harboring gene
2 mutation, and samples harboring both gene 1 and gene
2 mutations, respectively.
2. Calculate the significance level of the mutually exclusive pattern

Pexcl by formula (3):
n12
X
n1 n n1 n
P excl ¼ = ð3Þ
k n2 k n2
k¼0
where n, n1, n2, n12 are defined as in the formula of Pco above.
4 Interpretation of the Results
Both co-mutational and mutually exclusive patterns of gene pairs

have biological meaning implicated in signaling network inference.
For the co-mutational pattern, genes are likely to function in
different signaling pathways and exert synergistic impact on
tumor progression. Therefore, multiple oncogenic pathways
driving the tumorigenesis for a particular cancer type could be
identified by analyzing these co-mutational patterns. For the exclu-
sive pattern, more insights could be obtained. In particular, affilia-
tion of genes to a signaling pathway can be inferred from a list of
gene pairs with exclusive pattern. For example, if A-B, B-C, and
C-A are all exclusive gene pairs, then it is reasonable to conclude
that genes A, B, and C are likely to operate in the same signaling
pathway for the particular cancer type in question. When the whole
list of gene pairs is visualized in one graph, as shown in the bottom
of Fig. 2 (by Cytoscape [18]), a signaling network can emerge.
However, this preliminary signaling network subjects to modifica-
tion based on prior knowledge about gene-gene/protein-protein
interactions and experimental evidence for real applications (see
Note 5). To conclude, the exclusive patterns can be used to infer
a specific cancer-associated signaling pathway, while the
co-mutational patterns can assist in exploring whether multiple
oncogenic pathways were involved, as described in our previous
work [5] and references therein.
5 Notes
1. If combinatorial mutational pattern analysis needs to be con-

ducted on tumor subtypes instead of tissue types, the keyword
“Site subtype” can be used to further divide mutations into
smaller groups of subtypes. For TCGA data, since genomic and
epigenomic data are deposited separately for different cancer
types, the mutation data for a cancer type of interest can be
downloaded directly from the TCGA web site by choosing level
2 MAF (Mutation Annotation Format) file. Also, the informa-
tion of amino acid change corresponding to nucleotide alter-
ation is not available in TCGA, step 2 (Subheading 3.1, which
aims at removing the coding silent mutations) can be skipped.
Since all the mutation data in TCGA are based on exome-wide
sequencing, therefore, it is no necessary to implement the
screening procedure specified in step 3 (Subheading 3.1).
2. When generating the gene mutation pattern matrix, the muta-
tions counted for each gene could be restricted to particular
mutation types such as missense point mutations or gene
fusions, depending on the biological question to be answered
and the working model hypothesis to be tested.
3. The threshold of sample coverage used to exclude the less
frequently mutated genes can be adjusted according to the
sample size and/or gene set size, to yield a reasonable number
of combinations of genes. In our previous practices, the thresh-
old sample coverage was set to 2–10%.
4. When determining the thresholds for mutational pattern cate-
gorization, the mixture Gaussian model based on the
Expectation-Maximization algorithm sometimes can produce
inconsistent outcomes over technical replicates. This is largely
due to the stochastic properties implicated in the EM (Expec-
tation Maximization) optimization procedure. A reliable alter-
native is to simply use LRcomb ¼ 1 to divide candidate gene
pairs into two groups, with LRcomb < 1 referring to exclusive
pattern and the remainder co-mutational pattern. Then rank
the pairs in each group according to P values. After that, select a
reasonable number (e.g., 20–30) of the top-ranked significant
gene pairs in respective groups.
5. Although the pipeline is applied to gene pairs, sets of genes
with particular combinatorial patterns could emerge by inte-
grating gene pairs of corresponding patterns. Thus, the pipe-
line introduced in this protocol can serve as a starting point for
inferring signaling network modules in particular cancers.
Acknowledgments
This work was partially supported by the Beijing Normal University

youth funding (105502GK and 2013YB43 to H.T.) and National
Institutes of Health (1U01CA166886, 1R01LM010185, and
1U01HL111560 to X.Z.).
References
1. Hanahan D, Weinberg RA (2000) The hall- mutations in cancer. FASEB J 22
marks of cancer. Cell 100(1):57–70. https:// (8):2605–2622. https://doi.org/10.1096/fj.
doi.org/10.1016/S0092-8674(00)81683-9 08-108985
2. Stratton MR, Campbell PJ, Futreal PA (2009) 11. Ciriello G, Cerami E, Sander C, Schultz N
The cancer genome. Nature 458 (2012) Mutual exclusivity analysis identifies
(7239):719–724. https://doi.org/10.1038/ oncogenic network modules. Genome Res 22
Nature07943 (2):398–406. https://doi.org/10.1101/gr.
3. Peng H, Tan H, Zhao W, Jin G, Sharma S, 125567.111
Xing F, Watabe K, Zhou X (2016) Computa- 12. Vandin F, Upfal E, Raphael BJ (2012) De novo
tional systems biology in cancer brain metasta- discovery of mutated driver pathways in cancer.
sis. Front Biosci 8:169–186 Genome Res 22(2):375–385. https://doi.
4. Tan H, Bao J, Zhou X (2012) A novel org/10.1101/gr.120477.111
missense-mutation-related feature extraction 13. Leiserson MD, Blokh D, Sharan R, Raphael BJ
scheme for ‘driver’ mutation identification. (2013) Simultaneous identification of multiple
Bioinformatics 28(22):2948–2955. https:// driver pathways in cancer. PLoS Comput Biol 9
doi.org/10.1093/bioinformatics/bts558 (5):e1003054. https://doi.org/10.1371/jour
5. Tan H, Bao J, Zhou X (2015) Genome-wide nal.pcbi.1003054
mutational spectra analysis reveals significant 14. Forbes SA, Beare D, Gunasekaran P, Leung K,
cancer-specific heterogeneity. Sci Rep Bindal N, Boutselakis H, Ding M, Bamford S,
5:12566. https://doi.org/10.1038/ Cole C, Ward S, Kok CY, Jia M, De T, Teague
srep12566 JW, Stratton MR, McDermott U, Campbell PJ
6. Tan H, Li F, Singh J, Xia X, Cridebring D, (2015) COSMIC: exploring the world’s
Yang J, Bao J, Ma J, Zhan M, Wong STC knowledge of somatic mutations in human can-
(2012) A 3-dimentional multiscale model to cer. Nucleic Acids Res 43(Database issue):
simulate tumor progression in response to D805–D811. https://doi.org/10.1093/nar/
interactions between cancer stem cells and gku1075
tumor microenvironmental factors. IEEE 6th 15. Greenman C, Stephens P, Smith R, Dalgliesh
International Conference on Systems Biology GL, Hunter C, Bignell G, Davies H, Teague J,
(ISB):297–303. https://doi.org/10.1109/ Butler A, Stevens C, Edkins S, O’Meara S,
ISB.2012.6314153 Vastrik I, Schmidt EE, Avis T, Barthorpe S,
7. Tan H, Wei K, Bao J, Zhou X (2013) In silico Bhamra G, Buck G, Choudhury B,
study on multidrug resistance conferred by Clements J, Cole J, Dicks E, Forbes S,
I223R/H275Y double mutant neuraminidase. Gray K, Halliday K, Harrison R, Hills K,
Mol BioSyst 9(11):2764–2774. https://doi. Hinton J, Jenkinson A, Jones D, Menzies A,
org/10.1039/c3mb70253g Mironenko T, Perry J, Raine K, Richardson D,
8. Vogelstein B, Papadopoulos N, Velculescu VE, Shepherd R, Small A, Tofts C, Varian J,
Zhou S, Diaz LA Jr, Kinzler KW (2013) Can- Webb T, West S, Widaa S, Yates A, Cahill DP,
cer genome landscapes. Science 339 Louis DN, Goldstraw P, Nicholson AG,
(6127):1546–1558. https://doi.org/10. Brasseur F, Looijenga L, Weber BL, Chiew
1126/science.1235122 YE, DeFazio A, Greaves MF, Green AR,
Campbell P, Birney E, Easton DF, Chenevix-
9. Vogelstein B, Kinzler KW (2004) Cancer genes Trench G, Tan MH, Khoo SK, Teh BT, Yuen
and the pathways they control. Nat Med 10 ST, Leung SY, Wooster R, Futreal PA, Stratton
(8):789–799. https://doi.org/10.1038/ MR (2007) Patterns of somatic mutation in
nm1087 human cancer genomes. Nature 446
10. Yeang CH, McCormick F, Levine A (2008) (7132):153–158. https://doi.org/10.1038/
Combinatorial patterns of somatic gene nature05610
16. Ihaka P, Gentleman R (1996) R: a language for 18. Shannon P, Markiel A, Ozier O, Baliga NS,
data analysis and graphics. J Comput Graph Wang JT, Ramage D, Amin N,
Stat 5(3):299–314 Schwikowski B, Ideker T (2003) Cytoscape: a
17. Dempster AP, Laird NM, Rubin DB (1977) software environment for integrated models of
Maximum likelihood from incomplete data via biomolecular interaction networks. Genome
EM Algorithm. J Roy Stat Soc B Met 39 Res 13(11):2498–2504. https://doi.org/10.
(1):1–38 1101/gr.1239303. 13/11/2498 [pii]
Chapter 2
Discovering Altered Regulation and Signaling Through

Network-based Integration of Transcriptomic, Epigenomic,
and Proteomic Tumor Data
Amanda J. Kedaigle and Ernest Fraenkel
Abstract
With the extraordinary rise in available biological data, biologists and clinicians need unbiased tools for data
integration in order to reach accurate, succinct conclusions. Network biology provides one such method for
high-throughput data integration, but comes with its own set of algorithmic problems and needed
expertise. We provide a step-by-step guide for using Omics Integrator, a software package designed for
the integration of transcriptomic, epigenomic, and proteomic data. Omics Integrator can be found at
http://fraenkel.mit.edu/omicsintegrator.
Key words Data integration, Network biology, Computational biology, High-throughput data
1 Introduction
As biologists gain access to increasing amounts of data, the chal-

lenges associated with interpreting those data have increased. Biol-
ogists and clinicians can obtain high-throughput information about
a cell’s genome, transcriptome, epigenome, and proteome with
reasonable effort and constantly decreasing costs. Indeed, much
of those data are freely available to scientists through resources such
as The Cancer Genome Atlas [1] and ENCODE [2]. The challenge
remains, however, in knowing how to interpret those rich datasets.
These “omic” data can be extraordinarily valuable. However, this
value can only be extracted if data are properly analyzed using
methods that account for the relatively high error rate of high-
throughput experiments [3], and then condensed into understand-
able and actionable hypotheses about the underlying biology. This
process can be especially difficult, and especially rewarding, when
attempting to integrate several kinds of high-throughput data. Our
group and others have shown that integrating data from several
13
14 Amanda J. Kedaigle and Ernest Fraenkel
sources can lead to novel discoveries that each assay could have
missed on its own [4–6].
Network biology is a fast-growing category of methods for this
type of analysis [7]. Network models provide a valuable resource for
biologists looking to analyze their high-throughput data in a sys-
tems context. By mapping “hits” from high-throughput assays
onto interaction networks, the mechanistic connections between
the hits become obvious, and investigators can focus on pathways,
or series of interactions in the cell that are related to a certain
function, that may be perturbed in the system.
Network methods typically involve modeling the molecules
within a cell—which can for example be DNA, mRNAs, proteins,
or metabolites—as nodes in a graph. Edges between these nodes
connect molecules that are functionally or physically connected
[7]. For example, a protein-protein interaction network (PPI)
would represent the binding of protein A to protein B by drawing
an edge between the “A” node and “B” node in the network.
Several publicly available databases have been created to translate
experimentally discovered protein interactions into PPIs, such as
iRefIndex [8], BioGRID [9], and STRING [10]. There are also
databases that store interactions of proteins with other molecules,
such as metabolites [11–13]. In other types of networks, the edges
can represent more abstract relationships. For example, in a
correlation-based network, edges between nodes might represent
probable co-regulation, rather than physical interactions, based on
covariance between the concentration of molecule A and molecule
B [14, 15].
Mapping high-throughput hits onto networks in search of
affected pathways has several advantages. Hits that are close to
each other in a network might function in the same pathway.
Focusing on subnetworks of functionally related nodes can produce
a more tractable number of targets, rather than the potentially
hundreds of individual factors identified in high-throughput
experiments. In addition, this type of pathway identification
reduces the chance of devoting resources to the analysis of false
positives from the high-throughput screen. Although the confi-
dence for each hit in a screen may be low, the confidence in a
pathway that contains many hits is much higher. Finally, pathway
analysis can help to find novel nodes that may not have appeared in
a high-throughput screen. These “hidden nodes” can be false
negatives in a screen, or true negatives that are nonetheless impor-
tant players in the investigated biological system. Our work has
shown that these hidden nodes can often be important to a system
under study, despite the lack of direct experimental evidence [4, 16,
17]. Using the PPI to discover these pathways de novo, rather than
relying on predetermined pathway databases like KEGG [18],
expands our ability to find novel information, and avoids biasing
the results toward well-studied pathways.
Network-based Integration of ‘Omic’ Data 15
However, network analysis is not as simple as just mapping

high-throughput assay hits onto PPIs and finding all possible con-
nections through them. Because of the large and highly connected
nature of most biological networks [7], this “brute force” method
results in extremely dense, uninterpretable “hairballs” rather than
clear pathways [16]. Moreover, combining several types of experi-
mental assays into a unified analysis can be complex. For example,
experiments assessing changes in mRNA levels and protein levels
are often not well correlated [19]. It is not trivial to map them onto
one protein or RNA interaction network together. This chapter will
walk you through the use of Omics Integrator, a software package
that proposes a solution to these problems [17].
Omics Integrator is a new software tool designed to help
biologists analyze and synthesize several kinds of high-throughput
omics data, and reduce it to a few important, high-confidence
pathways. Omics Integrator is designed for ease of use by biologists
with basic computer skills (comfort with using the Unix command
line is helpful). Omics Integrator first uses transcriptomic and
epigenomic data to reconstruct transcriptional regulatory net-
works, and then integrates those with proteomic data by mapping
them onto a protein interaction network [17]. It uses two mod-
ules—Garnet and Forest, which are designed to run sequentially,
but can also be run individually. Garnet mines transcriptomic and
epigenomic information in order to predict transcription factors
that may be responsible for gene expression changes in the studied
system. Forest maps these transcription factors and protein-level
experimental information onto a PPI. Forest then implements the
Prize-Collecting Steiner Forest algorithm [16] to predict high-
confidence low-density protein interaction pathways that are
important to the studied system (see Fig. 1).
2 Materials
2.1 Finding 1. Transcriptomics data, i.e., differential gene expression between

Transcriptional different conditions in your study (i.e., tumor vs. control).
Regulators with Garnet 2. Epigenomic data from a source such as TCGA [1], ENCODE
[2], Roadmap [20], Omics Integrator example data, or experi-
mentally derived epigenomic data (in a BED formatted file).
3. Transcription factor sequence binding motif predictions, from
a source such as TRANSFAC [21], and/or Neph et al.
[22]. Omics Integrator provides a file derived from the
TRANSFAC database.
2.2 Network 1. Prize-collecting Steiner tree algorithm executable (msgsteiner

Integration can be downloaded from http://areeweb.polito.it/ricerca/
with Forrest cmp/code/bpsteiner).
Fig. 1 Outline of the Omics Integrator workflow. Epigenomic data (open chromatin regions or histone marks)
and transcriptomic data are used to predict influential transcription factors (TFs). Transcription factors and
proteomic data are then mapped onto an interactome, and the Prize Collecting Steiner Forest algorithm is used
to produce small pathways and subnetworks predicted to be relevant to the experimental system
2. Interactome file indicating all known interactions between

proteins. Omics Integrator provides an interactome for
mouse and human proteins derived from iRefIndex [8].
3. Input prize file, indicating the proteins you would like to
include in the final solution (see Note 1).
4. (Optional) Output from Garnet to include transcription factors
implicated by transcriptomic data in the final solution.
5. Cytoscape [25] to visualize the final network solution.
3 Methods
3.1 Installation 1. You can run Omics Integrator as a web tool on our website:
of Omics Integrator http://fraenkel.mit.edu/omicsintegrator/or install it on your
own computer using the instructions at https://github.com/
fraenkel-lab/OmicsIntegrator. You should make sure you have
all dependencies (see Note 2) installed and that you have the
most updated version of Omics Integrator from our GitHub
page (see Note 3).
3.2 Finding Garnet uses differentially expressed genes from your transcriptomic
Transcriptional assays (i.e., RNA-seq) to predict transcription factors (TFs) that are
Regulators with Garnet likely to be responsible for the altered gene expression. It uses
epigenomic data to find regions of the genome to look for differ-
ential TF binding. For example, this could be ATAC-seq data that
points out accessible regions of the genome in your cell type. The
algorithm will search for transcription factor binding motifs within
regions implicated by your epigenomic data. The strength of these
motifs is then correlated with the magnitude of change of nearby
differentially expressed genes to give each TF a score.
1. Obtain epigenomic data for cell lines related to your samples
from one of the sources listed under Subheading 2.1. Alterna-
tively, if you have epigenomic data for your own samples, you
can use this as well. These data can be in the form of histone
marks ChIP-seq, or DNase-seq or ATAC-seq, all of which
indicate accessible chromatin regions where a TF might be
bound. Collect these data in a BED-formatted file.
2. Go to the Galaxy webserver [23] (see Note 4) to extract the
DNA sequences for your epigenomic regions. Upload your
BED file to Galaxy under the “Get Data” tool, specify which
genome you are using, and then use the “Fetch Alignments/
Sequences”>“Extract Genomic DNA” tool to download a
FASTA-formatted file.
3. Format your experimentally derived gene expression data in a
tab-delimited file with two columns. The first should be the
name of the gene, and the second should be the log-fold-
change of that gene in the study conditions (i.e., tumor
vs. control). We recommend only including genes with a statis-
tically significant change in expression (see Note 5).
4. Create the Garnet configuration file. For an example configu-
ration file, see the README on the Omics Integrator GitHub
page, or the comment on the top of scripts/garnet.py. Your
configuration file should be formatted similarly, but you should
replace the paths to the bedfile, fastafile, and expressionFile with
the paths to the files you created in steps 1–3 in Subheading
3.2. Make sure the annotation files referenced by genefile,
xreffile, and genome are using the correct genome for your
sample (files for mm9 and hg19 are provided with Omics
Integrator).
5. You can change the parameters to your liking (Table 1).
6. Run Garnet on the command line by navigating to the direc-
tory with garnet.py and running python garnet.py yourconfigfile.
cfg. You can also add a --outdir directoryname flag if you would
like to put the output from garnet into a different directory.
Table 1
An explanation of the parameters used by Garnet
windowsize This parameter determines the maximum distance in nucleotides from a gene TSS to a TF
binding motif to consider them related. Higher values will find more TFs, but their
binding may be farther away from the gene, and thus, less likely to be directly related to
expression. Values usually range from 2000 to 20,000
pvalThresh The p value of a correlation measures how likely you are to get this correlation value if the
events were not correlated. This threshold determines which transcription factors will be
passed to Forest. Only those whose correlation with expression falls below the provided
threshold will be included. Recommended values range from 0.01 to 0.05. Leave this
value blank to use a q value threshold rather than a p value
qvalThresh A q value is a False Discovery Rate adjusted p value. This measurement will result in fewer
false positives. This threshold determines which transcription factors will be passed to
Forest. Only those whose correlation with expression falls below the provided threshold
will be included. Recommended values range from 0.01 to 0.05. Leave this value blank if
a p value threshold is sufficient. (If you are going on to run Forest, a p value is generally
sufficient since the network nature of Forest make false positives less likely to appear in a
final network)
Garnet will run through several steps, informing you on the

command line where it is in the process. These steps include:
l Mapping the genes to nearby epigenetic regions.
l Scanning those regions for TF binding motifs.
l Building a matrix of gene expression changes and binding motif
scores for each TF.
l Running a regression to check the correlation of TF binding
score with differential gene expression.
Garnet will print results into several tab-delimited files. These
files are described in the README file on the Omics Integrator
GitHub page. The file that ends in regression_results.tsv shows all
TFs, clustered by similar binding sites, along with their p- and q-
values from the regression. The file that ends in FOREST_INPUT.
tsv contains only significant results and will be used in future steps.
3.3 Network Forest integrates proteomic data and the output from Garnet into a
Integration with Forest network. After mapping the data onto a provided interactome
network, it uses the prize-collecting Steiner tree algorithm (solved
by the msgsteiner code that you downloaded and installed) to find
an optimal set of subnetworks. These subnetworks can then be
analyzed for pathway context.
1. If you are not using the default interactome provided with
Omics Integrator, prepare your input interactome file. An
interactome file (or “edge file”) contains the large network of
all known connections between nodes. The file should be
formatted in three tab-delimited columns. Each line should

have the form “interactor1 (tab) interacter2 (tab) weight.”
The third column contains an edge weight, between 0 and
0.99, usually representing the confidence in the validity of
that edge. Optionally, you can include a fourth column indicat-
ing whether that edge is directed (“D”) or undirected (“U”).
The current default interactome for human or mouse tissue is
derived from iRefIndex (version 13) [8] and scored with the
MIScore system [24]. You can find it in the data folder, called
iref_mitab_miscore_2013_interactome.txt. You should create
your own interactome file if you are not running your experi-
ments in mouse or human cell models, or if you have a more
updated interactome for your experiments.
2. Prepare your input prize file. This file contains significant fea-
tures from your proteomic data (see Note 6). It should have
two tab-delimited columns: the protein name (matching the
interactome file exactly), and the protein prize. You should
assign higher prizes to proteins for which you have stronger
evidence that they should be in the final network.
3. Prepare your configuration file. This file contains input para-
meters for your run of Forest (Table 2). An example can be
found in the example/a549 folder, called tgfb_forest.cfg. At a
minimum this file must contain values for the parameters w, b,
Table 2
An explanation of the parameters used by Forest
w This parameter influences the number of separate trees detected, which can aid in identifying
functionally distinct processes. Higher values of w lead to more trees in the optimal forest,
while lower values force most prizes to be found in the same tree. Values usually range
from 1 to 10. See Tuncbag et al. [14] for a more detailed explanation
b This parameter linearly scales the prizes, thereby changing the relative weighting of edge
weights and node prizes. Higher values lead to larger trees, including some
low-confidence edges, while lower values force networks to be small and use only high
confidence edges, and lead to the possible exclusion of some prize terminals. Values
usually range from 1 to 20
D This parameter sets the maximum depth from the dummy node, or root of the tree, to the
leaf nodes. Higher values lead to long pathways, while lower values lead to shorter
disparate pathways. Values usually range from 5 to 15
μ This parameter controls negative prizes in Forest. Negative prizes are explained in detail in
Section 3.4.2. The default value is zero, and if you want to use negative prizes, values
usually range from 0.0001 to 0.1
garnetBeta This parameter controls the relative weighting of TF scores derived from Garnet and prize
values on proteomic nodes. Higher values will encourage the inclusion of more TF nodes
in the network, while lower values force networks to include only the most significant or
pathway-relevant TF nodes. Typically, the value for this parameter is set to the median
value of the proteomic prizes divided by the median value of the TF scores
and D. If you are including results from Garnet, you will also
need a garnetBeta parameter. See Subheading 3.4.1 for more
information.
4. You can now run forest with the command python forest.py –p
yourprizefile.txt –e youredgefile.txt –c yourconfigfile.txt --garnet
yourgarnetoutput_FOREST_INPUT.tsv. You can also add a
--outlabel yourexperimentname flag to give your output files a
prefix and a --outpath directoryname flag if you would like to
put the output from forest into a different directory. You may
need to add a --msgpath directoryname flag to indicate where
you installed the msgsteiner code during the installation step.
There are several other optional flags you can add to this
command if wanted (see Note 7).
5. Forest will run through several steps, informing you on the
command line where it is in the process. These steps include:
l reading in your input files.
l running the msgsteiner optimization.
l writing the output files.
Output files are described in the README file on the Omics

Integrator GitHub page (see Note 8).
6. To visualize the network output, open the Forest output files in
Cytoscape [25]. Open Cytoscape and import a network. The
Forest output files that end in .sif have been formatted for this
purpose. The file ending in optimalForest.sif contains only
those edges used in the optimal Steiner forest, while augmen-
tedForest.sif contains all edges in the interactome between the
nodes in the final forest, and is recommended for final analysis.
You can then import tables to annotate those networks; the
nodeattributes.tsv file and the edgeattibutes.tsv file, to view
information about the nodes and edges in the network, such
as the edge weights and the node prizes. Node attributes also
include the node prize type: TF, proteomic, or blank to indi-
cate a hidden node which had no input prize but was chosen by
the algorithm to connect prize nodes. Cytoscape has many
useful visualization tools that you can use to better represent
these values and types [25, 26] (see Note 9).
3.4 Network Quality 1. We recommend checking the robustness and specificity of your
Control networks. You can do this by adding flags to the forest.py
command. Add --noisyEdges 10 to test robustness of your net-
work to noise in the edge weights. This command will add
Gaussian noise to the edgeweights, re-run Forest ten (or your
input number of) times, and then merge the results into output
files with noisyEdges in the filenames. Add --randomTerminals
10 to test specificity of your network to your input terminals.
This command will randomly redistribute your prizes among

the interactome, keeping the degree distribution of your origi-
nal prizes, re-run Forest ten times, and then merge the results
into output files with randomTerminals in the filenames. Both
of these flags will increase the runtime of forest significantly (see
Note 10).
2. Forest results include an attribute representing the fraction of
optimal forests containing each node, which indicates how
often that node appeared in the various forest runs with noise
or random inputs. A robust network will have high FractionO-
fOptimalForestsContaining values for most nodes in noi-
syEdges run, and nodes that are specific to your input data
will have low FractionOfOptimalForestsContaining values
after randomTerminals runs. These metrics can be especially
useful ways to judge the importance of hidden nodes to your
system.
3.4.1 Choosing The resulting network from this data integration algorithm is
Parameters for Forest highly dependent on several parameters. These include w, b, D, μ,
and garnetBeta (Table 2).
We recommend running Forest over a range of these values to
find the best set for your system. To see an example of a script for
testing parameters, see OmicsIntegrator/example/GBM/
GBM_case_study.py. Once you have several resulting networks,
we recommend choosing the best result by
1. Choosing a set of parameters that maximizes the fraction of
input prize nodes that are included in the final network and are
robust to noise (as judged by the noisyEdges runs).
2. Some parameters will lead to networks with large “hubs,” that
is, one hidden protein in the middle connected to several prize
nodes with few interactions between these “spokes.” These
hubs are usually not informative or very specific to one system.
We recommend choosing parameters that minimize this by
measuring the average degree of hidden nodes in your network
(i.e., the number of edges connecting to those nodes in the
interactome) compared to the average degree of prize nodes. A
good parameter set will minimize the distance between these
metrics. Figure 2 shows an example of this analysis using the
data in the example/a549 folder (see Fig. 2).
3. Once conditions 1 and 2 are satisfied, we prefer larger net-
works, as those provide the most opportunities for novel dis-
coveries of hidden nodes and pathways enriched in the
subnetworks.
3.4.2 Negative Prizes One of the more innovative aspects of Omics Integrator is its ability
in Forest to incorporate negative evidence. There are two settings in which
Fig. 2 An analysis of several parameter sets when running Forest on the sample A549 data provided with
Omics Integrator. A good parameter set will minimize the difference between the average degree of prize
nodes and hidden nodes, and will include a large number of prize nodes. A good choice of a parameter set is
highlighted by the black arrow. The A549 dataset reflects phosphoproteomic changes in a lung cancer cell line
when stimulated with TGF-beta. The black arrow highlights a network that includes relevant nodes such as
EGFR, while networks with large average degree of hidden nodes are mostly comprised of a hub centered on
ubiquitin-C, which connects to most prize nodes in the interactome, but is not specific to the lung cancer cell
system
negative prizes can be useful. First, if you have reason to believe

certain nodes should not show up in your optimal network you can
assign a negative prize to a node and include it in the input prizes
file along with positive prizes. Second, negative prizes can be used
to avoid bias toward “hub nodes.”
We have found that in many cases, certain nodes are overrepre-
sented in network integration solutions because they have a high
“degree,” or number of edges connecting to that node, in the
interactome. This could be because they bind with low specificity,
e.g., chaperone proteins, or because they are highly studied pro-
teins, causing more of their interactions to be discovered and
represented in the literature. Because the optimal solution to the
PCSF problem has the lowest cost method of connecting nodes, it
will tend to use these nodes regardless of the input data. Simply
removing these nodes from the network is not desirable, as there

are settings in which they are relevant. To prevent hubs from being
over-represented in all networks, Forest adds a penalty to nodes
based on their degree. This penalty discourages solutions that
include hubs but still allows them to be present when indicated
by the data. This has been shown to improve accuracy in certain
networks [17]. A positive number of the parameter μ will cause all
nodes in the interactome to incur a penalty of μ*degree.
4 Notes
1. Problems in running Omics integrator can originate from

spaces in node names, or mismatched node names. Input files
to Garnet and Forest should have no spaces in the protein and
gene names. In addition, all node names in the input files
should match those in the interactome exactly. Forest will try
to catch this error by letting you know if a large percentage of
your input nodes were not found in the interactome. The
provided iRefIndex interactome uses Official Gene Symbols
for protein nodes, so when using this interactome, input files
should also use this nomenclature.
2. Currently, Omics Integrator requires Python 2.6 or 2.7, with the
python packages numpy, scipy, matplotlib, and Networkx. You
will need Cytoscape (http://www.cytoscape.org) [25, 26] for
viewing network results. Any updates will be reflected in the
“System Requirements” section on our GitHub page (see Note 3).
3. GitHub is an online hosting service for repositories of code. It
lets the community contribute to improvements of open source
projects like Omics Integrator, and keeps track of changes
made and bugs reported. The latest version of Omics Integra-
tor can be found on its GitHub page: https://github.com/
fraenkel-lab/OmicsIntegrator. A new version of Omics Inte-
grator, using Python version 3, is under development
at https://github.com/fraenkel-lab/OmicsIntegrator2.
4. Galaxy is an online platform for computational biologists. In
addition to the Extract Genomic Sequences tool described
here, Galaxy provides several tools and workflows for analyzing
biological data [23].
5. Genes used in Garnet should be significantly differentially
expressed according to your transcriptomic data. For example,
RNA-seq data can be analyzed with tools such as DEseq [27]
or CuffDiff [28]. Genes that these tools report as differential
with a p value less than 0.05 should be used as the input to
Garnet.
6. Similar to transcriptomic data, your proteomics data will indi-

cate which proteins should be used as the input to Forest. A
review of tools for differential proteomics can be found here
[29]. Many of these tools will provide a metric for determining
statistical significance of differential expression of proteins,
such as a p value. We generally use all proteins with a (modified)
p value of less than 0.05. Prizes for the proteins are then the
absolute value of the log of the fold change of protein expres-
sion. Be sure to use the absolute value, to avoid assigning a
negative prize to downregulated proteins, which would
encourage the algorithm to leave that node out of the net-
works, rather than including it.
7. There are several other flags available for advanced users, which
change the behavior of forest.py. For example, you can change
the group of nodes Forest uses to root each resulting tree
(by default, this is all nodes which have been assigned a positive
prize). There is a knockout option for doing an in silico knock-
out experiment by removing a protein from the interactome.
For details on these and other flags, run python forest.py -h or
read our GitHub repository page.
8. Many problems can lead to the final Forest output being empty
(i.e., not containing any nodes). Check the output file ending
in “info.txt” for some statistics of the run. One common
problem, once formatting and input protein name problems
have been ruled out, is a mu parameter set too high or other
Forest parameters that lead to an empty optimal solution. Try
changing your parameter values.
9. Cytoscape is a popular open source software for visualizing and
analyzing networks [25, 26]. It is highly flexible and there are
several available plug-ins for extending its use [30]. Omics
Integrator can output results formatted for import into Cytos-
cape versions 2.8 or 3 by the use of a flag for forest.py
(it defaults to version 3). Once the networks and node and
edge attributes are imported into Cytoscape, you can use
options in Cytoscape to create informative figures of your
results. For example, we often use the Style tab to change the
color of a node to represent its prize, the shape of a node to
represent its Terminal Type (TF vs. proteomic vs. hidden
node), and the edge width to represent its confidence. We
recommend playing around with Styles and Layouts to best
display your network.
10. Depending on your input data and run setup, a run of Omics
Integrator can take a few hours. We recommend running in a
screen session (https://www.gnu.org/software/screen/) or
tmux (https://tmux.github.io/), which will allow the program
to run continuously in the background, or on a computer that
is set not to turn off or interrupt the run. You can also run
Omics Integrator on a cloud server. However, if the run is
taking more than a day, you should cancel the run and look
for errors. In particular, try running Forest without or with a
smaller input to noisyEdges or randomTerminals, as these
options can lead to large memory and time consumption.
High values for the D parameter can also increase runtime.
Acknowledgments
This work was supported by grants from National Institute of

Health (R01-NS089076, T32-GM008334, and
U01-CA184898). We thank Tobias Ehrenberger and Renan
Escalante-Chong for helpful comments on the manuscript.
References
1. Tomczak K, Czerwińska P, Wiznerowicz M database with provenance. BMC Bioinformat-

(2015) The Cancer Genome Atlas (TCGA): ics 9:405. https://doi.org/10.1186/1471-
an immeasurable source of knowledge. Con- 2105-9-405
temp Oncol (Pozn) 19:A68–A77. https:// 9. Tyers M, Breitkreutz A, Stark C et al (2006)
doi.org/10.5114/wo.2014.47136 BioGRID: a general repository for interaction
2. Encode Consortium (2013) An integrated datasets. Nucleic Acids Res 34:D535–D539.
encyclopedia of DNA elements in the human https://doi.org/10.1093/nar/gkj109
genome. Nature 489:57–74. https://doi.org/ 10. Szklarczyk D, Franceschini A, Wyder S et al
10.1038/nature11247 (2015) STRING v10: protein-protein interac-
3. Malo N, Hanley JA, Cerquozzi S et al (2006) tion networks, integrated over the tree of life.
Statistical practice in high-throughput screen- Nucleic Acids Res 43:D447–D452. https://
ing data analysis. Nat Biotechnol 24:167–175. doi.org/10.1093/nar/gku1003
https://doi.org/10.1038/nbt1186 11. Wishart DS, Jewison T, Guo AC et al (2013)
4. Huang S-SC, Fraenkel E (2009) Integrating HMDB 3.0—the human metabolome database
proteomic, transcriptional, and interactome in 2013. Nucleic Acids Res 41(Database issue):
data reveals hidden components of signaling D801–D807. https://doi.org/10.1093/nar/
and regulatory networks. Sci Signal 2:ra40. gks1065
https://doi.org/10.1126/scisignal.2000350 12. Thiele I, Swainston N, Fleming RMT et al (2013)
5. Ideker T, Thorsson V, Ranish JA et al (2001) A community-driven global reconstruction of
Integrated genomic and proteomic analyses of human metabolism. Nat Biotechnol 31:
a systematically perturbed metabolic network. 419–425. https://doi.org/10.1038/nbt.2488
Science 292:929–934. https://doi.org/10. 13. Kuhn M, Szklarczyk D, Pletscher-Frankild S
1126/science.292.5518.929 et al (2014) STITCH 4: integration of
6. Huang SSC, Clarke DC, Gosline SJC et al protein-chemical interactions with user data.
(2013) Linking proteomic and transcriptional Nucleic Acids Res 42(Database issue):
data through the interactome and epigenome D401–D407. https://doi.org/10.1093/nar/
reveals a map of oncogene-induced signaling. gkt1207
PLoS Comput Biol 9(2):e1002887. https:// 14. Valcárcel B, W€ urtz P, al Basatena NKS et al
doi.org/10.1371/journal.pcbi.1002887 (2011) A differential network approach to
7. Barabási A-L, Oltvai ZN (2004) Network biol- exploring differences between biological states:
ogy: understanding the cell’s functional orga- an application to prediabetes. PLoS One 6(9):
nization. Nat Rev Genet 5:101–113. https:// e24702. https://doi.org/10.1371/journal.
doi.org/10.1038/nrg1272 pone.0024702
8. Razick S, Magklaras G, Donaldson IM (2008) 15. Kotze HL, Armitage EG, Sharkey KJ et al
iRefIndex: a consolidated protein interaction (2013) A novel untargeted metabolomics
correlation-based network analysis incorporat- 23. Blankenberg D, Von Kuster G, Coraor N et al

ing human metabolic reconstructions. BMC (2010) Galaxy: a web-based genome analysis
Syst Biol 7:107. https://doi.org/10.1186/ tool for experimentalists. Curr Protoc Mol
1752-0509-7-107 Biol. https://doi.org/10.1002/0471142727.
16. Tuncbag N, Braunstein A, Pagnani A et al mb1910s89
(2013) Simultaneous reconstruction of multi- 24. Villaveces JM, Jiménez RC, Porras P et al
ple signaling pathways via the prize-collecting (2015) Merging and scoring molecular inter-
steiner forest problem. J Comput Biol actions utilising existing community standards:
20:124–136. https://doi.org/10.1089/cmb. tools, use-cases and a case study. Database
2012.0092 2015:bau131. https://doi.org/10.1093/data
17. Tuncbag N, Gosline SJ, Kedaigle AJ et al base/bau131
(2016) Network-based interpretation of 25. Shannon P, Markiel A, Ozier O et al (2003)
diverse high-throughput datasets through the Cytoscape: a software environment for
Omics Integrator software package. PLoS integrated models of biomolecular interaction
Comput Biol 12(4):e1004879 networks. Genome Res 13:2498–2504.
18. Aoki-Kinoshita KF, Kanehisa M (2007) Gene https://doi.org/10.1101/gr.1239303
annotation and pathway mapping in KEGG. 26. Smoot ME, Ono K, Ruscheinski J et al (2011)
Methods Mol Biol 396:71–91. https://doi. Cytoscape 2.8: new features for data integra-
org/10.1007/978-1-59745-515-2_6 tion and network visualization. Bioinformatics
19. Maier T, G€ uell M, Serrano L (2009) Correla- 27:431–432. https://doi.org/10.1093/bioin
tion of mRNA and protein in complex formatics/btq675
biological samples. FEBS Lett 27. Love MI, Anders S, Huber W (2014) Moderated
583:3966–3973. https://doi.org/10.1016/j. estimation of fold change and dispersion for
febslet.2009.10.036 RNA-seq data with DESeq2. Genome Biol.
20. Bernstein BE, Stamatoyannopoulos JA, Cost- https://doi.org/10.1186/s13059-014-0550-8
ello JF et al (2010) The NIH Roadmap Epige- 28. Trapnell C, Hendrickson DG, Sauvageau M
nomics Mapping Consortium. Nat Biotechnol et al (2013) Differential analysis of gene regu-
28:1045–1048. https://doi.org/10.1038/ lation at transcript resolution with RNA-seq.
nbt1010-1045 Nat Biotechnol 31:46–53. https://doi.org/
21. Matys V, Kel-Margoulis OV, Fricke E et al 10.1038/nbt.2450
(2006) TRANSFAC and its module TRANS- 29. Bantscheff M, Lemeer S, Savitski MM, Kuster
Compel: transcriptional gene regulation in B (2012) Quantitative mass spectrometry in
eukaryotes. Nucleic Acids Res 34: proteomics: critical review update from 2007
D108–D110. https://doi.org/10.1093/nar/ to the present. Anal Bioanal Chem
gkj143 404:939–965. https://doi.org/10.1007/
22. Neph S, Vierstra J, Stergachis AB et al (2012) s00216-012-6203-4
An expansive human regulatory lexicon 30. Saito R, Smoot ME, Ono K et al (2012) A
encoded in transcription factor footprints. travel guide to Cytoscape plugins. Nat Meth-
Nature 489:83–90. https://doi.org/10. ods 9:1069–1076. https://doi.org/10.1038/
1038/nature11212 nmeth.2212
Chapter 3
Analyzing DNA Methylation Patterns During Tumor Evolution

Heng Pan and Olivier Elemento
Abstract
Epigenetic modifications play a key role in cellular development and tumorigenesis. Recent large-scale
genomic studies have shown that mutations in players of the epigenetic machinery and concomitant
perturbation of epigenomic patterning are frequent events in tumors. Among epigenetic marks, DNA
methylation is one of the best studied. Hyper- and hypo-methylation events of specific regulatory elements
(such as promoters and enhancers) are sometimes thought to be correlated with expression of nearby genes.
High-throughput bisulfite converted sequencing is currently the technology of choice for studying DNA
methylation in base-pair resolution and on whole-genome scale. Such broad and high-resolution coverage
investigations of the epigenome provide unprecedented opportunities to analyze DNA methylation pat-
terns, which are correlated with tumorigenesis, tumor evolution, and tumor progression. However, few
computational pipelines are available to the public to perform systematic DNA methylation analysis.
Utilizing open source tools, we here describe a comprehensive computational methodology to thoroughly
analyze DNA methylation patterns during tumor evolution based on bisulfite converted sequencing data,
including intra-tumor methylation heterogeneity.
Key words DNA methylation, ERRBS, DMRs, Intra-tumor methylation heterogeneity
1 Introduction
Epigenetic modification plays a key role in the regulation of all

DNA-based processes including transcription, DNA repair, and
replication, which are fundamental to tumorigenesis [1]. Recent
large-scale genomic studies have shown that mutations in the epi-
genetic machinery and concomitant perturbation of epigenomic
patterning are frequent events in tumors, such as B-cell lympho-
mas, leukemia, and prostate cancers [2–5]. DNA methylation is one
of the best-studied epigenetic markers. DNA methylation is char-
acterized by the attachment of a methyl group to carbon 5 of
cytosines, principally in the context of CpG dinucleotides. Hyper-
or hypo-methylation of genomic regions (for example promoters or
enhancers) can lead to repression or activation of the expression of
nearby genes. Several examples of promoter methylation levels that
are inversely correlated with gene expression levels have been
27
28 Heng Pan and Olivier Elemento
identified [6]. A number of tumor suppressor genes are silenced by

promoter hypermethylation [7]. Thus, identifying differentially
methylated cytosines (DMCs) and differentially methylated regions
(DMRs), especially those perturbed in tumors, has become a cen-
tral objective in cancer methylome analysis.
The advent of high-throughput DNA sequencing technologies
has provided new opportunities to study DNA methylation, allow-
ing for fast, single-base resolution scans in targeted or enriched
regions or at whole-genome scale. Large-scale sequencing projects
have generated hundreds of methylation profiles of tumors of
different origins and at different stages. In turn, markers based on
methylation usually provide important information about cellular
phenotypes in healthy and diseased tissues. In many cases, assessing
methylation profiles enabled improved patient stratification over
other approaches based on transcriptomics or mutation profiles
[4, 8].
Enhanced reduced representation bisulfite sequencing
(ERRBS) is a powerful high-throughput sequencing platform,
which can provide high sequencing depth and coverage of millions
of CpGs in the human genome [9]. To date, few computational
pipelines for analyzing bisulfite sequencing exist, even though such
data are increasingly widely used. Here, we describe a comprehen-
sive DNA methylation profiling analysis based on ERRBS data. This
method is also applicable to reduced representation bisulfite
sequencing (RRBS) data or whole-genome bisulfite sequencing
(WGBS) data [10].
Tumors evolve following a Darwinian process, in which cells
continuously acquire mutations that alter their fitness. The fittest
cancer cells may divide faster, and will be more likely to survive
inhibitory signals from the microenvironment than other less fit
cells. Those cells are therefore more likely to expand in abundance
within a tumor. Initiation of anti-cancer treatment can alter the
fitness landscapes within tumors and frequently leads to selection
and growth of cells that have acquired resistance mutations.
Because every tumor potentially evolves along a different trajectory
as a result of distinct environments and exposures to treatment,
tumor evolution introduces individual features into each tumor
[11]. While the contribution of genetic mechanisms to tumor
evolution is well documented, the contribution of epigenetic
mechanisms to tumor evolution has only recently begun to be
studied [4, 12]. In this chapter, we specifically focus our analyses
on capturing and analyzing DNA methylation patterns during
tumor evolution.
Following generation of an ERRBS dataset, a typical analysis
workflow consists of first identifying DMCs and DMRs, followed
by correlating those regions to biological relevant genes and path-
ways. Other analyses relevant to tumor evolution may include
quantifying intra-tumor methylation heterogeneity (MH). Indeed,
Tumor Methylome Analysis 29
compared to the genetic code, DNA methylation is more flexible

and cells within a cell population may have distinct methylation
patterns. Thus a tumor cell population may harbor varying levels
intra-tumor MH. Such heterogeneity is emerging as a powerful
predictor of tumor evolution, progression, and relapse [4].
The analysis of DNA methylation patterns during tumor evo-
lution requires ERRBS samples from different tumor development
stages, or diagnosis-relapse sample pairs from several patients.
ERRBS samples from normal healthy (which can be used as base-
line) can add an additional layer of information to the analysis. The
computational analysis scheme is outlined in Fig. 1. Sample prepa-
ration and high-through sequencing are described by Akalin et al.
[9]. Sequencing reads from every stage of tumor evolution are
mapped independently to a bisulfite converted reference genome,
generating separate DNA methylation profiles. DNA methylation
status for every single CpG is determined in each sample, separately.
Next, DMC and DMR calling is performed between the normal
sample and each tumor stage or between any two tumor stages.
Intra-tumor MH analysis is performed as a separate analysis on the
same samples. Finally, identified DMCs/DMRs and MH hotspots
are analyzed for enriched gene functions, in order to unravel path-
ways relevant to tumor evolution.
2 Materials
2.1 Software 1. ERRBS data quality control: FastQC (available at http://www.

bioinformatics.babraham.ac.uk/projects/fastqc/) [13].
2. Adaptive quality and adapter trimming: Trim Galore (available
at http://www.bioinformatics.babraham.ac.uk/projects/trim_
galore/) [14].
3. Bisulfite converted sequence reads mapping and cytosine
methylation states calling: Bismark (available at http://www.
bioinformatics.babraham.ac.uk/projects/bismark/) [15].
4. DMCs and DMRs calling: RRBSseeqer (available at http://icb.
med.cornell.edu/wiki/index.php/Elementolab/) [4].
5. Region annotations: ChIPseeqer (available at http://icb.med.
cornell.edu/wiki/index.php/Elementolab/) [16].
6. Pathway analysis: iPAGE (available at http://icb.med.cornell.
edu/wiki/index.php/Elementolab/) [17].
7. ERRBS data analyzing tools: Errbs-tools including methylCall_-
from_Bismark.py, regionMethyl.R and regionMH.R (available
at https://github.com/SpursHeng90/errbs-tools/) [4].
FastQC
FASTQ
FastQC report
files
Trim Galore
Trimmed
FASTQ
files
Bismark
Bismark Errbs-tools
Methyl
BAM files
files
Errbs-tools
Errbs-tools RRBSseeqer
ChIPseeqer
Regulatory
MH DMCs/DM
Methyl
Hotspots Rs
Region
Gene list
Global MH Pathway Unsupervised Supervised

analysis Analysis Analysis Analysis
Fig. 1 Schematic of ERRBS data analysis pipelines. This comprehensive computational pipelines start from the
ERRBS FASTQ files. The first step is to use FastQC to perform quality check of ERRBS data and make sure the
data quality is good enough to make downstream analysis. Second, Trim Galore is used to remove adapter
contaminations. Third, Bismark is used to map reads to bisulfite converted genomes and call methyl files,
which indicates methylation status for each CpG site in genomes. Next, many computational tools including
Errbs-tools, ChIPseeqer and RRBSseeqer are used to perform downstream analysis. MH hotspots can be used
to perform global MH analysis and link MH to tumor evolution and disease progression. Individual DMCs/DMRs
can be annotated to nearest genes and such gene lists can be used to perform pathway analysis. Methylation
levels for regulatory regions are good inputs for both supervised and unsupervised types of downstream
analysis
2.2 Input Files 1. For fastqc in FastQC: FASTQ format files of normal or tumor
samples are most common inputs, BAM or SAM format files
are also acceptable.
2. For trim_galore in Trim Galore: FASTQ format files of healthy
tissue or tumor samples are required.
3. For bismark_genome_preparation in Bismark: FASTQ/
FASTA format files of genome reference are required.
4. For bismark in Bismark: FASTQ format files processed with
trim_galore are required. FASTA format files are also accept-
able but not recommended since the quality values are missing
for such types of data.
5. For bismark_methylation_extractor in Bismark: BAM files
from bismark are used as inputs.
6. For methylCall_from_Bismark.py in Errbs-tools: CpG_OT_-
sample.RRBS_trimmed.1bp.fq_bismark.txt and CpG_OB_-
sample.RRBS_trimmed.1bp.fq_bismark.txt from bismark_m
ethylation_extractor are used as input files. Reads in file set
1 (labeled with OT) reflect methylation levels of CpGs in the
forward strand. Reads in file set 2 (labeled with OB) contain m
ethylation information of CpGs in the reverse strand.
7. For epicore2calls.pl in RRBSseeqer: Methyl files from methyl-
Call_from_Bismark.py are used as input files (see Table 1).
8. For RRBSseeqer_CG in RRBSseeqer: output files from epi-
core2calls.pl are used as inputs.
9. For RRBSidentifyUpDownDMR.pl in RRBSseeqer: output
files with DMCs information from RRBSseeqer_CG are used
as input files.
10. For ChIPseeqerAnnotate, mergeCSAnnotateGenesCol-
umns.pl, make_PAGE_input.pl and page.pl in ChIPseeqer:
files with DMR information from RRBSidentifyUp-
DownDMR.pl are used as inputs. Each tool uses the output
files from the previous one for those four sequential tools.
11. For regionMethyl.R in Errbs-tools: two kinds of input files are
required. One is the Methyl file from methylCall_from_Bis-
mark.py (see Table 1), the other one is RDS format file includ-
ing genomic region annotations in GRanges or GRangesList
objects [18, 19]. RDS is a special R based format, which can
store a single R object.
12. For regionMH.R in Errbs-tools: three types of input files are
required. The first type is the Methyl file from methylCall_-
from_Bismark.py (see Table 1). The second one is the BAM
file from bismark, which is a binary format for storing
sequence data. BAM format is a more space-saving format as
compared to SAM format data. The last one is the RDS file
format including the genomic region annotations in GRanges
or GRangesList objects [18, 19].
Table 1
RRBSeeqer input files example
chrBase chr Base Strand Coverage freqC freqT

chr1.10542 chr1 10542 F 587 99.83 0.17
chr1.10636 chr1 10636 F 57 85.96 14.04
chr1.10617 chr1 10617 F 58 100 0
chr1.10589 chr1 10589 F 58 100 0
chr1.10631 chr1 10631 F 56 100 0
chr1.10638 chr1 10638 F 57 85.96 14.04
chr1.10609 chr1 10609 F 58 98.28 1.72
chr1.10620 chr1 10620 F 59 93.22 6.78
chr1.10525 chr1 10525 F 609 95.24 4.76
chr1.10497 chr1 10497 F 606 97.85 2.15
chr1.10633 chr1 10633 F 58 89.66 10.34
chr1.133181 chr1 133181 R 118 61.02 38.98
chr1.133218 chr1 133218 R 117 54.7 45.3
chr1.133180 chr1 133180 F 131 40.46 59.54
chr1.133165 chr1 133165 F 136 88.24 11.76
chr1.135028 chr1 135028 F 168 88.1 11.9
chr1.135203 chr1 135203 R 77 87.01 12.99
chr1.135208 chr1 135208 R 77 90.91 9.09
chr1.135173 chr1 135173 R 78 87.18 12.82
chr1.134999 chr1 134999 F 170 31.18 68.82
chr1.135191 chr1 135191 R 76 94.74 5.26
chr1.135179 chr1 135179 R 79 67.09 32.91
chr1.135031 chr1 135031 F 168 90.48 9.52
chr1.135218 chr1 135218 R 71 78.87 21.13
chr1.136911 chr1 136911 F 103 92.23 7.77
chr1.136913 chr1 136913 F 101 94.06 5.94
chr1.136895 chr1 136895 F 104 67.31 32.69
chr1.136876 chr1 136876 F 104 95.19 4.81
chr1.136925 chr1 136925 F 103 0.97 99.03
chr1.137120 chr1 137120 F 29 96.55 3.45
chr1.137157 chr1 137157 F 29 100 0
(continued)
Table 1
(continued)
chrBase chr Base Strand Coverage freqC freqT

chr1.137169 chr1 137169 F 28 0 100
chr1.139059 chr1 139059 F 13 0 100
chr1.139029 chr1 139029 F 13 61.54 38.46
chr1.139073 chr1 139073 F 13 61.54 38.46
chr1.237094 chr1 237094 F 99 7.07 92.93
chr1.249382 chr1 249382 R 29 0 100
chr1.249429 chr1 249429 R 29 93.1 6.9
chr1.531247 chr1 531247 R 29 100 0
chr1.531265 chr1 531265 R 29 100 0
3 Methods
3.1 Pre-alignment Similar to other high-throughput sequencing technologies,

Quality Control ERRBS is prone to systematic errors and artifacts such as PCR
and Data Cleaning duplicates, GC-content shifts, and adapter contamination. In addi-
Processes tion to these common problems, ERRBS data can suffer more
critical problems such as erroneous methylation status of cytosine
introduced by end-repair and low bisulfite conversion rates. Thus,
it is always a good idea to perform a simple quality control analysis
to avoid any biases that may affect subsequent analyses. ERRBS
generates output in FASTQ format, like most high-throughput
sequencing assays. We have successfully used a publicly available
tool -FastQC- to perform short read quality control (see Fig. 1).
Several features are very import to pay attention to. Those include
per base sequence quality, per sequence quality scores, per base
sequences content, and adapter content. Details about how to
interpret the FastQC results are available in Andrews et al. (seeNote
1) [13].
Tools such as FastQC may reveal a variety of artifacts including
adapter contamination. Adapter contamination is one of the most
important technical issues for next-generation sequencing data, in
that it may affect read mapping, leading to low mapping efficiencies
and may even result in incorrect mapping and/or unreliable meth-
ylation calling in ERRBS. Moreover, positions filled in during
end-repair can introduce artificial methylation readouts in
ERRBS. The restriction endonuclease MspI selects relatively small
fragment sizes (usually between 40 and 220 bp, but with quite a
few MspI-MspI fragments even shorter than 40 bp). This can

become a problem especially for sequencing reads with longer
lengths. If the read length is longer than the MspI-MspI fragment
size, there is a higher chance that the sequencing read would
contain the adapter sequence on the 30 end. To address this prob-
lem, adapters from longer reads need to be trimmed. We have
successfully used Trim Galore to perform adapter trimming in our
pipelines (see Fig. 1).
Altogether the pre-alignment process follows these three steps:
1. Quality check for original ERRBS reads with FastQC:
$ fastqc [-o output dir] [-f fastq|bam|sam] seqfiles1 . . . seqfilesN
-o parameter specifies the directory where all outputs from

FastQC should be stored. -f parameter indicates the input file
format, usually FASTQ format files. “seqfiles1 . . . seqfilesN”
indicates that multiple input FASTQ files can be analyzed.
2. Adapter trimming with Trim Galore:
$ trim_galore [options] <filenames>
To run this analysis, just specify optional parameters and indi-

cate your input FASTQ files after the options. The most rele-
vant options are --rrbs and --adapter. --rrbs specifies that the
data is an MspI digested library and --adapter specifies that
adapter sequences need to be trimmed from reads.
3. Quality check for adapter trimmed ERRBS reads with FastQC:
Perform the same analysis as in step 1, but use trimmed
FASTQ files instead.
Example:
Take one of our FASTQ files as an example (DLBCL_1D.
ERRBS.fq), the actual commands are:
1. $ fastqc -o. -f fastq DLBCL_1D.ERRBS.fq
2. $ trim_galore --rrbs --adapter TGAGATCGGAA-
GAGCGGTTCAGCAGGAATGCCGAGACCGATCTCG
TATGC --output_dir. DLBCL_1D.ERRBS.fq
3. $fastqc -o. -f fastq DLBCL_1D.ERRBS_trimmed.fq
In our examples, we assume all of the samples and files to be
analyzed are placed in the current working directory. If not, please
specify the exact path of the files instead of using DLBCL_1D.
ERRBS.fq directly. In step 2, the sequence we used is the standard
adapter for Illumina. If no sequence is supplied, trim_galore will
attempt to auto-detect the adapter that has been used (Illumina,
Small RNA, and Nextera platforms standard adapters would
be used).
In general, per base sequence quality provided by FastQC should

be higher than cutoff (20 in most cases). Also, average per sequence
quality should be around 38 (Phredþ33, 0-41 scale). The other
important restriction is that there should be no overrepresented
sequences or adapter contents, which can be an indication of potential
adapter contamination. Ideally, your dataset should pass most quality
control steps after adapter trimming (hence step 3 above) to confirm
the data quality for downstream analysis (seeNote 1). Also, besides
the two parameters in step 2 we mentioned, there are several items
worth consideration during quality control, the details of which are
provided in Subheading 4 (seeNotes 2 and 3).
3.2 ERRBS Reads Mapping ERRBS reads to a bisulfite converted genome presents
Alignment many computational challenges. Alignments should allow for mis-
matches, especially for potential methylation sites. Also, alignments
should be unique considering the numerous possibilities combin-
ing all the methylation statuses in each read to avoid miscalling of
methylation levels. Among all the publicly available mapping tools
such as BSMAP, RMAP-bs, MAQ, or BS seeker, we have chosen
Bismark [15] to map ERRBS reads due to a couple of substantial
advantages (seeNote 4) [20–23].
The alignment process requires two steps:
1. Bisulfite converted genome preparation: typically no parameter
changes are required for the genome preparation process. The
only thing that absolutely needs to be specified is the directory
where genome references are located. Such files need to be in
FASTA/FASTQ format and can be downloaded from public
databases such as UCSC genome browser or Ensembl
[24, 25]. A recent genome build is recommended, e.g., hg19
or GRCh38.
$ bismark_genome_preparation [options] <path_to_genome_folder>
<path_to_genome_folder> specifies the directory of genome

reference. Using --bowtie1 will create bisulfite indexes for
Bowtie 1 instead of Bowtie 2 (Default).
2. ERRBS reads alignment: for this step, the path to the bisulfite
converted genome directory and the adapter trimmed FASTQ
files are needed and the alignment can be run with default
parameters.
$ bismark [options] <genome_folder> {-1 <mates1> -2

<mates2> | <singles>}
<genome_folder> specifies the directory of bisulfite con-

verted genome reference from the first step. When paired-end
data are used as input, -1 <mates1> -2 <mates2> is used to
indicate each FASTQ file. If single-end data are used here,

<singles> is used to specify the ERRBS reads file in FASTQ
format. Several parameters are often used in options. For exam-
ple, --bowtie1 indicates bismark will use Bowtie 1 instead of
Bowtie 2. -l specifies the “seed length,” which is the number of
bases at the high quality end of the read to which the mis-
matches ceiling applies. Typically, the length of sequencing
reads can be used for this parameter. --multicore sets the
number of parallel instances of bismark to be run concurrently.
--output_dir specifies that all output files are written into the
specified directory in BAM format.
Example:
Using one of our trimmed FASTQ files as an example
(DLBCL_1D.ERRBS_trimmed.fq), the actual commands are:
1. $ bismark_genome_preparation --bowtie1 genome/
2. $bismark--bowtie1-l50--multicore6genome/--output_dir.
DLBCL_1D.ERRBS_trimmed.fq
In our examples, we assume that all sample files (.fq) and
reference genomes are placed in the current working directory. If
not, the exact path of the files needs to be specified instead of using
genome/ and DLBCL_1D.ERRBS_trimmed.fq directly. Addi-
tional information regarding other optional parameters in the sec-
ond step (Bismark alignment), such as usage of Bowtie 1 or Bowtie
2, as well as directional or non-directional sequencing can be found
in Subheading 4 (seeNote 5) [26, 27].
3.3 Cytosine Once suitable ERRBS alignments are generated, the methylation
Methylation State level for individual sites (mostly CpG sites) can be determined. To
Calling be consistent with our alignment processes, we utilize a simple
script, named bismark_methylation_extractor from Bismark to
achieve this goal. After methylation levels are generated, we need
to perform quality checks to assess the accuracy of individual CpG
methylation levels. Then we convert the data into a user-friendlier
format for further analysis. This process consists of the following
steps:
1. Extract CpG methylation levels from BAM files: we use bis-
mark_methylation_extractor from Bismark to extract CpG
methylation levels from each read in the BAM files. This tool
is one of the most important advantages of Bismark compared
to other computational tools (seeNote 4).
$ bismark_methylation_extractor [options] <genome_folder> <filenames>
<genome_folder> specifies the directory of bisulfite con-

verted genome reference (outputs from
bismark_genome_preparation). <filenames> specifies the

BAM files from bismark alignment (outputs from bismark
alignment). -s option indicates that single-end type sequence
read data was used. --multicore sets the number of parallel
instances to be run concurrently. --output_dir specifies the
directory to which all output files are exported. Using default
options, bismark_methylation_extractor will give six differ-
ent output files. Those include two types of possible strand-
specific methylation information, original top strand reads
(OT) and original bottom strand reads (OB), within three
different contexts (CpG, CHG, CHH). Details about those
files can be found in Subheading 4 (seeNote 6). Each file has
methylation information of a single-cytosine group, for exam-
ple, methylation information for cytosines in CpG context
from OT reads. Those files are tab delimited with 1-based
coordinates:
<seq-ID> <methylation state> <chromosome> <start posi-

tion (¼end position)> <methylation call>
Each line in these files represents single CpG site information

from a single read. The second column specifies the methyla-
tion state: “þ” and “” indicate methylated and unmethylated
status, respectively. Also, the fifth column has methylation call
information. “Z” and “z” indicate methylated and unmethy-
lated CpGs. Meanwhile, “X” and “x” represent methylated and
unmethylated CHGs and “H” and “h” represent methylated
and unmethylated CHH context. In our analysis, we only focus
on cytosines in CpG dinucleotide context and we use --mer-
ge_non_CpG to merge non-CpG methylation (CHG context
and CHH context) results into one file. Reads in files labeled
with OT reflect methylation levels of CpGs in the forward
strand and reads in files labeled with OB contain methylation
information of CpGs in the reverse strand.
2. Quality check for CpG methylation extraction: after we extract
methylation information from each read, we need to check
several features to make sure the alignment and methylation
status are correct and ready for downstream analysis. All the
features below can be found in report files from bismark
alignment (see Table 2).
l It is important to check that all the methylation call statuses
in the OT and OB files are “Z” and “z” to make sure that all
the information is collected from the CpG context.
l One of the most important features is average conversion
rate, which detects how bisulfite treatment can successfully
convert unmethylated cytosines. This number should be
very close to 100% (see Table 2).
Table 2
Bismark output statistics example
Mapping Average #CpGs Average Average CpG

ID #reads efficiency conversion rate (10) coverage methylation levels(%)
1 73293324 64.50% 99.8979 2848900 51.82 34.40%
2 76101498 66.00% 99.8947 2933770 50.14 31.50%
3 85482964 66.50% 99.8838 2822016 56.84 39.50%
4 80272372 66.50% 99.8138 2795046 56.97 35.90%
5 64361288 66.60% 99.7396 2625874 50.8 34.50%
6 78897537 66.90% 99.8999 2782944 56.52 33.80%
7 76009876 66.00% 99.8751 2850695 52.66 38.20%
8 75431630 65.90% 99.778 2893682 52.82 38.40%
9 76653335 65.20% 99.8868 2843240 54.8 39.00%
10 78481481 65.00% 99.8859 2808703 54.04 38.50%
11 73287618 67.60% 99.8617 2733715 52.24 37.20%
12 73847281 67.50% 99.7764 2920328 50.99 36.50%
13 81504963 66.30% 99.8747 2822336 58.31 41.90%
14 96892822 62.50% 99.8899 2795175 59.89 48.80%
15 43137414 64.20% 99.8866 1609370 48.55 30.80%
16 72217478 66.50% 99.7681 2741285 54.08 45.90%
17 72922475 66.90% 99.8376 2872522 53.3 38.20%
18 75434628 66.80% 99.8349 2839250 52.93 38.30%
19 73437411 66.30% 99.8747 2767916 52.07 39.10%
20 82936367 68.00% 99.8547 2823987 52.74 36.10%
l Samples should have acceptable mapping efficiency

(uniquely mapped reads out of all the input sequenced
reads after adapter trimming, should be 60% or higher),
which typically decreases with increasing sequencing read
length (see Table 2).
l Samples analyzed together should ideally have similar num-
bers of covered CpGs, similar average coverage across the
whole dataset, and similar average CpG methylation levels in
each category/group (see Table 2).
3. Call CpG methylation level for each CpG: after quality check,
we are ready to collect and combine the methylation status of
all reads overlapping with individual CpGs into an overall
methylation level. We created a Python script to automate this

process.
$ python methylCall_from_Bismark.py [options]

<sample_name> <input_dir> <output_dir>
<sample_name> sets the name of healthy tissue or tumor

sample to be analyzed. This script can perform methylation
calls for all the samples in the targeted directory. <input_dir>
specifies where input files are located. <output_dir> specifies a
directory into which all output Methyl files are written (see
Table 1). -c is the only parameter that the user needs to adjust
here; it specifies the minimum coverage required per CpG site.
4. Data transformation for RRBSseeqer: RRBSseeqer requires
special format for input data files. We use a Perl script to
convert Methyl files to RRBSseeqer acceptable data formats.
$ perl epicore2calls.pl <i n p u t _ f i l e > | gzip >

<output_file>
<input_file> sets the name of the Methyl files from step 3 (see
Table 1). <output_dir> specifies that all output files are writ-
ten into this directory.
Example:
Using one of our BAM files as an example (DLBCL_1D.
ERRBS_trimmed.fq_bismark.bam), the commands are as
follows:
1. $ bismark_methylation_extractor -s --output. --merge_n-
on_CpG --multicore 6 --genome_folder genome/
DLBCL_1D.ERRBS_trimmed.fq_bismark.bam
2. $ python methylCall_from_Bismark.py -c 10 DLBCL_1D
bismark_output/ cpg/
3. $ perl epicore2calls.pl cpg.DLBCL_1D.mincov10.txt | gzip
> cpg.DLBCL_1D.mincov10.txt.calls.gz
As before, all files, directories, and samples are assumed to be in
the current working directory. The full path to each file and direc-
tory needs to be specified otherwise, if the files are present in a
different location. When working with non-directional ERRBS
data, additional parameters are required as indicated in Subheading
4 (seeNotes 6 and 7). In the above example, the minimum coverage
per CpG was set to 10. Enough reads can support the reliability of
methylation levels for CpGs. For ERRBS analysis, 10 is always used
as the cutoff, which is a tradeoff value considering the available
number of CpGs and sequencing cost. Please find suggestions
about how to choose this parameter in Subheading 4 (seeNote 8).
3.4 Patient-Specific The identification of DMCs and DMRs is an important component

DMRs Analysis of DNA methylation analysis. DMCs and DMRs typically reflect
(Unsupervised) local DNA methylation changes during tumor evolution, such as
those occurring in tumors between diagnosis and relapse. Several
methods enable discovery of consistently hyper- or hypo-methylated
DMCs or DMRs across several samples. DMCs and DMRs can be
defined based on groups of samples or between two samples from
the same patient. We focus our analysis here on defining DMCs and
DMRs on a patient-specific basis. We use our in-house tool—
RRBSseeqer—to extract DMCs and DMRs from individual patient
data. This analysis is unsupervised in that any region of the genome
can be a DMR.
To investigate cancer progression and tumor evolution, one
may want to compare diagnosis and relapse samples from the
same cancer patient and analyze DMCs and DMRs between these
sample pairs. To examine the functional role of these methylation
changes, one may ask whether DMCs and DMRs are near genes
belonging to specific pathways, which might be relevant for cancer
biology. These types of analyses frequently include identification of
gene sets and pathways, which are over-represented within the
DMR-associated genes in each patient. To explore commonalities
across patients, one can perform unsupervised analyses of over-
represented pathways (Fig. 2a). In more detail, such analyses con-
sist of the following steps:
1. Identify DMCs: DMCs are identified by comparing two sam-
ples, for example healthy tissue and tumor, or diagnosis and
relapse from the same patient. We identify DMCs using Fisher
Exact or Chi-Square Tests comparing fractions of methylated
to total reads at individual CpGs. We use a default false discov-
ery rate ¼ 20% for this analysis. Data formatting scripts are used
to create tab-delimited output files.
$ RRBSseeqer_CG -rrbs1 <c o n t r o l _ f i l e > -rrbs2

<experiment_file> | sort_column.pl | sort_column_alpnum.
pl > <output_file>
<control_file> after -rrbs1 represent the baseline sample (for

example diagnosis sample), <experiment_file> after -rrbs2
represents the second sample to compare to baseline (for exam-
ple relapse sample). These files should be in the format pro-
duced by epicore2calls.pl above. <output_file> sets the name
of the output file containing an analysis of each CpG.
2. Identify DMRs: we have defined DMRs as regions containing
at least five DMCs separated by less than 250bp, and whose
average methylation difference (including non-DMC in the
region) is more than 10%. We use a Perl script called RRBSi-
dentifyUpDownDMR.pl to identify DMRs based on DMCs.
a
Enrichment
Depletion
Pathways
10
3
4
6
7
1.2
2
1.1
1.3
5
8
9
11
Patients
b Z-score of methylation
level (by row)
Regulatory elements
Diagnosis
Relapse
1D
2D
3D
4D
5D
6D
7D
8D
9D
10D
11D
1R1
1R2
1R3
2R
3R
4R
5R
6R
7R
8R
9R
10R
11R
Patient samples
Fig. 2 Examples of DMRs identification and visualization. (a) Pathways overrepresented among hypermethy-
lated genes (promoters overlapped with hypermethylation DMRs) of individual patients were illustrated here.
Each row represents a single pathway and each column represents a patient pair. (b) Each row represents a
single differentially methylated regulatory element. Each column represents single diagnosis/relapsed sample
from patients. Scale bars represent z-score of methylation levels. Values were centered and scaled in row
direction
$ perl RRBSidentifyUpDownDMR.pl --metfile¼<input_file> --

outfile¼<output_file> [options]
<input_file> is a tab-delimited file with CpG comparison from

RRBSseeqer_CG (step 1). <output_file> is a file containing
DMR information. DMRs are represented with a single DMR
per row including chromosome, start position, end position,
size, number of CpGs in DMR, and methylation difference.
Additional options can be specified. -dmax specifies the largest
distance between two DMCs (Default: 250). -minmetdx speci-
fies the minimum average DNA methylation difference for
DMRs (Default: 0.1). -minnumcg specifies the minimum num-
ber of DMCs needed to define a DMR (Default: 5).
3. Annotate DMRs with nearest genes: we use ChIPseeqerAn-
notate from the ChIPseeqer package to annotate DMRs with
the closest genes and identify genomic regions (exons, introns,
promoters, intergenic regions) where gene and DMR may
overlap.
$ ChIPseeqerAnnotate --peakfile¼<input_file> [options]
<input_file> represents a DMR file produced at the previous

step. Options include: --genome specifies what genome refer-
ence to use (hg19, etc.) and --db specifies the gene annotation
versions (RefSeq, etc.). There are several output files in this
step. We will need to use files with .genes.annotated.txt suffix.
Each row in this file indicates if this gene overlaps with DMRs
on different genomic regions like promoters, exons, introns,
etc.
4. Data transformation for iPAGE: DMR-associated genes are
converted into a format compatible with the iPAGE pathway
analysis tool. Likewise this analysis is performed using tools
from the ChIPseeqer package. First, we run the mergeCSAn-
notateGenesColumns.pl to extract specific columns from .
genes.annotated.txt file and retrieve the genes with peaks in
their promoters/exons/introns, etc. Next, a perl script make_-
PAGE_input.pl is used to convert data into iPAGE acceptable
input format.
$ mergeCSAnnotateGenesColumns.pl --genefile¼<input_file>
--outfile¼<output_file> [options]
<input_file> points to the output from ChIPseeqerAnno-

tate, specifically the file that ends with .genes.annotated.txt.
<output_file> defines the output file. Options include: --
geneparts, which specifies which gene parts overlapping with
DMRs should be used for downstream analysis. P (Promoter),
I (Intron), E (Exon) etc. and combination of these (separated

by commas) can be used here. --showORF specifies whether
gene id (1) or transcript id (0) should be used in the output.
$ make_PAGE_input.pl --geneslist¼<input_file> [options]
This tool creates an input file for iPAGE. Briefly, each gene is
labeled as “gene of interest” (a gene near a DMR) or “back-
ground”. <input_file> indicates the output files from mer-
geCSAnnotateGenesColumns.pl, which should be used as
input for this step. The --refgene parameter specifies the gene
data annotation used by ChIPseeqer and is used to create the
background gene category.
5. Pathway analysis of DMR-related genes: given a gene profile
with genes labeled either as genes of interest or as background,
iPAGE is used to run pathway analysis against known pathways
and gene sets. It uses mutual information to connect input
gene sets and published gene sets and pathways.
$ page.pl –expfile¼<input_file> [options]
<input_file> indicates the input file from last step. The --

pathways option in iPAGE defines the database of pathways
to use (Gene Ontology (GO), the Lymphoid Gene signatures
and many other databases are supported [28, 29]. It is also
feasible to use custom-defined pathways). The output of
iPAGE indicates over- or under-representation of the input
gene sets within specific gene sets or pathways with hypergeo-
metric distribution log10 enrichment p-values as pathway
enrichment scores.
6. Unsupervised analysis of over-represented pathways within
DMR-related genes: we use an R-based package pheatmap
[18, 30]. The input for this analysis is a matrix where each
row represents a pathway and each column represents a single-
sample pair (a diagnosis-relapse pair for example). Each entry in
the matrix value is 1 if a specific pathway is significantly
enriched in this patient, otherwise the value will be set to 0.
Example:
Here, we provide an example where we analyze DMCs and
DMRs between diagnosis and relapse tumor sample (DLBCL).
After the identification of DMCs and DMRs, following the strategy
outlined above we identify tumor evolution-related DMCs/
DMRs. Subsequently, we can perform downstream analysis inves-
tigating for example how those DMRs occur or disappear during
tumor progression. We use cpg.DLBCL_1D.mincov10.txt as our
< control_file>, cpg.DLBCL_1R.mincov10.txt as our <
experiment_file>. Examples of commands for calling DMCs and

DMRs, and subsequently annotating them are:
1. $ RRBSseeqer_CG -rrbs1 cpg.DLBCL_1D.mincov10.txt.
calls.gz -rrbs2 cpg.DLBCL_1R.mincov10.txt.calls.gz -test
chi | sort_column.pl 1 | sort_column_alpnum.pl 0 > DMC.
DLBCL_1.txt
2. $ perl RRBSidentifyUpDownDMR.pl --metfile¼DMC.
DLBCL_1.txt --dmax¼250 --minmetdx¼0.1 --min-
numcg¼5 –outfile¼DMR.DLBCL_1.txt
3. $ ChIPseeqerAnnotate --peakfile¼DMR.DLBCL_1.txt --
genome¼hg19 --db¼refSeq
4. $ mergeCSAnnotateGenesColumns.pl --genefile¼DMR.
DLBCL_1.txt.refSeq.GP.genes.annotated.txt --gen-
eparts¼P --showORF¼1 --outfile¼DMR.DLBCL_1.pro.txt
$ make_PAGE_input.pl --geneslist¼DMR.DLBCL_1.pro.
txt --refgene¼/data/hg19/refSeq
5. $ page.pl –expfile¼DMR.DLBCL_1.pro.txt.ORF.txt --
pathways¼human_go_orf --cattypes¼P,C,F -suffix¼GO
6. > pheatmap(mat, . . .)
In step 3, ChIPseeqerAnnotate is used to annotate DMRs
based on hg19 human genome and RefSeq gene annotations. In
step 4, we specifically extract DMRs overlapping with promoters
(defined as 2 kb windows centered on RefSeq transcription start
site). In step 5, we run pathway analysis against known Biological
Processes (BP) in the Gene Ontology [28]. Other pathway data-
bases such as KEGG pathways or msigDB pathways can be used in
this step [31–33]. Step 6 is different from other commands we
used in this chapter. It is an R command and needs to be run in R
environment. The 1-0 matrix mat needs to be provided to retrieve
the heatmaps. Generally, several options can be used in this func-
tion to modify heatmaps in R. For example, scale option is a
character indicating if the values should be centered and scaled in
either the row or the column direction, or none. cluster_rows and
cluster_cols are boolean values determining if rows/columns
should be clustered.
3.5 Genomic Region- The analysis in Subheading 3.4 is currently limited to pairwise
Specific DMRs sample analysis. While it can be extended to more than two samples,
Analysis (Supervised) an alternative approach is to compare the methylation levels of
specific regions across two groups of samples. Groups of samples
can be defined based on clinical variables such as diagnosis and
relapse, chemo-resistant versus chemo-refractory, etc. Genomic
regions can be defined as promoters, CpG Islands, enhancers, and
binding sites for certain proteins, e.g., CTCF [34]. The proposed
analysis identifies which of these predefined genomic regions are

differentially methylated between the two sample groups.
We created an R script to collect methylation levels for specified
regions and then perform supervised analysis between the two
groups in R [18]. The analysis applies statistical testing followed
by correction for multiple testing to assess differential methylation.
We usually need two steps to perform this analysis:
1. We use an R script named regionMethyl.R in Errbs-tools to
generate methylation levels for promoters of each patient. The
methylation level for each region in each sample is calculated by
averaging the methylation levels of all CpGs (with a threshold
of a minimum number of CpGs) inside the corresponding
regions. This script generates a matrix with methylation levels
for all the regions across all the samples.
$ R CMD BATCH --no-save --no-restore [options] region-

Methyl.R regionMethyl.log
We utilize R CMD BATCH to run the R script from the

command line. The regionMethyl.R script can be found in
Errbs-tools. regionMethyl.log stores the running log for the
script. --no-save specifies that nothing will be saved in the.
Rdata file. --no-restore specifies that R does not read the.
Rdata file in the current directory. We use those two arguments
since objects with identical names in the current R working
space could cause bugs in the program or the outputs of the
script could cause changes in the user’s working space. Several
arguments need to be specified in this step. --input_dir speci-
fies where Methyl files should be found (see Table 1). --out-
put_dir specifies where the output files should be created. --
regions specifies a RDS format file that contains genomic
region coordinates as GRanges or GRangesList objects
[18, 19].
2. Perform supervised analysis comparing methylation levels of
input regions between groups: we use paired T-tests or
Wilcoxon-tests between diagnosis and relapse sample pairs.
This analysis is followed by correction for multiple testing. A
minimum methylation difference is also often used to ensure
biological relevance of any significant change (at least 10%
methylation difference). Differentially methylated promoters
across all the patients can be visualized using a heatmap
(Fig. 2b). As before, we use the pheatmap R package [18, 30].
Example:
1. $R CMD BATCH --no-save --no-restore ’--args input_-
dir¼cpg/ output_dir¼. regions¼promoter22kb.rds
regionMethyl.R regionMethyl.log
2. > pheatmap(mat, . . .)
In this example the, --regions parameter specifies a list of
promoters (defined as 2kb windows centered on RefSeq tran-
scription start site) in GRanges data format [19]. Other regions
such as CpG islands and enhancers can also be used here.
3.6 Intra-tumor MH It is common to use DMRs or average DNA methylation levels

Characterization within a specified region to characterize DNA methylation. How-
ever, DNA methylation is generally measured on cell populations
consisting of hundreds of thousands or even millions of cells.
Bisulfite converted DNA reads may contain multiple CpGs thus
enabling a per-read analysis of DNA methylation patterns. In such
analyses, loci with identical average DNA methylation levels can
have distinct DNA methylation patterns between samples. Such
distinct patterns are the result of intra-tumor MH (see Fig. 3a).
Intra-tumor MH has been connected to gene expression levels and
to clinical outcomes in several types of tumors including chronic
lymphocytic leukemia (CLL) and Diffuse large B-cell lymphoma
(DLBCL) [4, 12]. For example, higher MH in the promoter of
certain genes was linked to lower expression of those same genes in
CLL. Patients with higher global MH levels at diagnosis stage of
DLBCL have a higher chance to relapse after chemotherapy [4].
The concept of epipolymorphism has been used to describe and
quantify methylation heterogeneity [35]. The epipolymorphism
level of a four-CpG locus (reads containing four or more contiguous
CpGs) was defined as the probability that epialleles randomly sam-
pled from the locus differ from one another. Higher epipolymorph-
ism corresponds to higher intra-tumor MH and vice-versa.
Epipolymorphism can be analyzed at individual loci based on
ERRBS data. A global epipolymorphism level can also be calculated.
We perform MH analysis using the following steps:
1. Epipolymorphism is calculated for each locus in a sample. The
epipolymorphism level of a 4-CpG locus in the cell population
is defined as the probability that epialleles randomly sampled
from the locus differ from one another. More specifically, if we
denote pi as for the fraction of each DNA methylation pattern
i in the cell population. The epipolymorphism equals 1 Σpi2.
The higher the epipolymorphism, the higher the intra-tumor
heterogeneity is. We created a script called regionMH.R to
evaluate epipolymorphism levels (the script can be found in
Errbs-tools). We utilize R CMD BATCH to run the R script
from command line:
$R CMD BATCH --no-save --no-restore [options] regionMH.R

regionMH.log
b c Loci in promoters
p=0.002414
65
1.0 High intra-tumor
Intra-tumor heterogeneity
heterogeneity 60
0.8
Epipolymorphism
55
0.6
50
0.4
Low intra- 45
0.2 tumor
40
heterogeneity
0.0 35
0 20 40 60 80 100 Diagnosis Relapse
DNA Methylation (%)
Fig. 3 Examples of intra-tumor MH analysis. (a) Epipolymorphism levels are dependent on DNA methylation
levels. All loci are divided into different groups based on their methylation level and the median epipolymorph-
ism of each group is calculated. Genome-wide intra-tumor MH is quantified by area under the median line. (b)
Median epipolymorphism lines for diagnosis and relapse tumors from patient 1.1 in our cohort. Intra-tumor MH
visibly decreased with tumor evolution. (c) Relapsed samples displayed significant lower intra-tumor MH. All
the loci located in gene promoter
Several arguments need to be specified in this analysis. --

cpg_dir sets the location of CpG Methyl files (see Table 1), --
bam_dir sets the directory containing BAM files created by
Bismark. Output files are written to --output_dir. --regions
point to promoter regions (defined as 2 kb windows centered

on RefSeq transcription start site) in GRanges format [19].
2. Epipolymorphism is correlated with the methylation level,
which means a locus has lower expected epipolymorphism
values when it is fully methylated or fully unmethylated com-
pared to the locus with 50% methylation levels) (see Fig. 3b).
Therefore, global epipolymorphism must be normalized by
methylation levels. Our global analysis divides loci into differ-
ent bins depending on their methylation levels and median
epipolymorphism is calculated for each bin. The area under
the median line is defined as MH for each patient (see Fig. 3b,
c). With this analysis, we can study the correlation between
MH and tumor evolution.
Example:
1. $R CMD BATCH --no-save --no-restore ’--args sam-
ple¼DLBCL_1 cpg_dir¼cpg/ bam_dir¼bam/ out-
put_dir¼. regions¼promoter22kb.rds regionMH.R
regionMH.log
A growing number of studies have shown that intra-tumor MH
is predictive of clinical outcome and tumor evolution. Such studies
are showing that tumors with higher MH progress faster and earlier
than tumors with low methylation heterogeneity. It is however
worth noting that there are several methods for calculating intra-
tumor MH (seeNote 9).
3.7 Conclusion Epigenetic modifications play a key role in cell development and
and Outlook tumorigenesis. DNA methylation is one of the best studied epige-
netic modifications. High-throughput bisulfite converted sequenc-
ing technology provides great opportunities to analyze DNA
methylation patterns during various physiological and pathophysi-
ological processes. DNA methylation is relevant for cancer biology
and has been link to tumor evolution. We here describe a compre-
hensive computational methodology to analyze DNA methylation,
utilizing open source tools and our own in-house software. Our
methodology starts from pre-alignment quality control and data
cleaning processes, followed by data alignment, methylation state
calling, and multiple downstream analyses. Following our direc-
tions, users can perform supervised and unsupervised analysis to
different scales, including base pair DMCs, patient-specific DMRs,
and genomic region-specific DMRs. Utilizing the above-
mentioned tools to identify DNA methylation abnormalities can
allow linking those to cellular development, tumor progression,
and tumor evolution.
It is still unclear how DNA methylation or epigenetic modifica-
tions contribute to genetic changes and subsequently influence
tumor evolution. Computationally, faster and more accurate
alignment is still needed to perform larger-scale and more reliable

analyses. Moreover, it will be equally important to design new
algorithms to identify DMCs/DMRs with lower false discovery
rate. There are still a number of unanswered questions in the field
of tumor methylation analysis. For example, promoter hyper-
methylation could so far only be correlated to lower gene expres-
sion levels in a subset of genes. Global correlation between
promoter hypermethylation and gene expression is still weak,
making it difficult to draw any mechanistic conclusions from meth-
ylation patterns. Compared to successful genetic perturbation,
modifying DNA methylation for specific regions is still a big chal-
lenge both in vivo and in vitro. Long-term efforts are needed for
this intriguing but complex study in tumor evolution.
4 Notes
1. FastQC performs a series of quality control analyses including

per base sequence quality, per sequence quality scores, per base
sequences content, adapter content. Each test is flagged with a
pass (green tick), warning (orange exclamation mark), or fail
(red cross). The assigned status depends on how much a sample
deviates from good quality benchmark samples. Examples of
good and bad quality samples can be found at http://www.
bioinformatics.babraham.ac.uk/projects/fastqc/ [13].
2. When running trim_galore, several parameters need to be
considered. First, the default value of -q/--quality <INT> is
20, which means reads with low-quality ends (under 20) are
trimmed. This parameter is acceptable for ERRBS analysis but
can be less stringent if more reads are needed for the analysis.
Second, the sequencing platform needs to be factored in. As
default trim_galore will use ASCIIþ33 quality scores as Phred
scores (option --phred33). ASCIIþ33 quality scores are usu-
ally used by Illumina 1.8þ, which encode a Phred quality score
from 0 to 41 using ASCII 33 to 74. If the sequencer did not
use ASCIIþ33 quality scores, use --phred64 option to specify
alternative quality scores. Third, when no adapter sequence was
provided, trim_galore will analyze the first one million
sequences of the first specified file and attempt to find the
first 12 or 13 bps of the following standard adapters:
Illumina: AGATCGGAAGAGC
Small RNA: TGGAATTCTCGG
Nextera: CTGTCTCTTATA
If using other adapters, it is important to provide correct
adapter sequences in this step.
3. One of the most important parameters for trim_galore is -s/--

stringency <INT>, which specifies the minimum number of
required overlaps with the adapter sequence. The default value
(1) is very stringent since even one overlap with the adapter
sequence would be removed. If a less stringent value is used
here, there is a higher chance of including too much adapter
contamination into the downstream analysis, thus distorting
the results. However, if one uses a very stringent cutoff (such as
the default value), it is possible that some reads are mistakenly
removed due to the first base being identical to adapters by
chance. If sequencing data does not have enough reads after
adapter trimming, CpGs coverage may be too low and down-
stream analyses such as DMCs/DMRs calling may be difficult.
4. There are several computational tools available, which the users
can employ to solve the alignment of bisulfite converted data
such as ERRBS. Before Bismark, different groups developed
analysis tools for bisulfite converted data, including BSMAP,
RMAP-bs, MAQ or BS seeker [20–23]. BS Seeker outper-
formed other mapping programs mentioned above, such as
BSMAP, RMAP-bs, or MAQ, in terms of mapping efficiency,
accuracy, and required CPU time [23]. Although the principles
underlying BS Seeker and Bismark are similar, Bismark offers a
number of advantages over BS Seeker [15]. For example, Bis-
mark can support single-end and paired-end data, variable read
length, adjustable insert size, and more adjustable mapping
parameters. Bismark is much faster than BS Seeker. Also, Bis-
mark can support non-directional library directly. The most
important advantage compared to other tools is that Bismark
not only does read mapping but, it also has tools for CpG
methylation calling, an important feature of ERRBS type data
analysis. For these reasons, Bismark is the most convenient tool
available and accordingly most widely used. However, BS
Seeker and Picard (https://broadinstitute.github.io/picard/)
are also good alternatives.
5. When mapping reads to the genome, one should be aware of
the sequencing library context. Directional sequencing libraries
are common. However, if sequencing is not directional the --
non_directional parameter should be used. Also, according to
the Bismark tutorial, Bowtie 1 instead of Bowtie 2 should be
used when trying to run alignment faster or when the sequenc-
ing reads are short [15]. Bowtie 1 usually performs equally well
as Bowtie 2 in such condition. However, when applied to
library with long fragment size (75 bp or above), Bowtie 2 is
always recommended and always shows better performance.
6. In Bismark, there are four kinds of output files storing methyl-
ation status. Those are labeled with OT, OB, CTOT, and
CTOB. Those files comprise reads, which are versions of the
original top strand, the original bottom strand, strands com-

plementary to OT, and strands complementary to OB, respec-
tively. If a library is directional, only reads that are versions of
OT and OB will be sequenced. Forward strand CpG methyla-
tion information can be retrieved from OT files and reverse
strand CpG information can be extracted from OB files. How-
ever, if libraries are constructed in a non-directional model, all
four different strands generated will end up in the sequencing
library with roughly the same likelihood. In this case, it is
important to extract forward strand CpG methylation informa-
tion from OT and CTOT files. OB and CTOB files collect
reverse strand DNA methylation information.
7. If the applied sequencing library is non-directional, it is important
to specify this when trying to use the bismark_methylation_
extractor.
8. Extracting methylation information from low coverage CpG is
typically considered not reliable due to potential sequencing
errors for specific base pairs in routine ERRBS data passing
process. It is recommended to remove CpGs covered by less
than ten reads in the original ERRBS method paper [9]. How-
ever, decreasing this parameter is possible when analyzing
WGBS data since WGBS is routinely low (10–15). In fact.
considering some people use 3 as the minimum reads in
WGBS data [36], lower coverage may be acceptable.
9. Many other types of analyses can be used to characterize
MH. Those include M-scores, Eloci, and PDR [8, 12,
37]. All these features have positive correlations with MH but
they show differences in other respects. For example, M-scores
only capture MH in each CpG site, which ignores the hetero-
geneity relation between adjacent CpGs. Eloci and PDR can-
not consider hyper/hypo direction when they estimate MH
changes. It is recommended to use different methods to study
MH and explore any potential differences between methods.
References
1. Dawson MA, Kouzarides T (2012) Cancer epi- Discov 3:1002–1019. https://doi.org/10.
genetics: from mechanism to therapy. Cell 1158/2159-8290.CD-13-0117
150:12–27 3. Shaknovich R, Melnick A (2011) Epigenetics
2. Clozel T, Yang S, Elstrom RL, Tam W, and B-cell lymphoma. Curr Opin Hematol
Martin P, Kormaksson M, Banerjee S, 18:293–299. https://doi.org/10.1097/
Vasanthakumar A, Culjkovic B, Scott DW, MOH.0b013e32834788cf
Wyman S, Leser M, Shaknovich R, 4. Pan H, Jiang Y, Boi M, Tabbò F, Redmond D,
Chadburn A, Tabbo F, Godley LA, Gascoyne Nie K, Ladetto M, Chiappella A, Cerchietti L,
RD, Borden KL, Inghirami G, Leonard JP, Shaknovich R, Melnick AM, Inghirami GG,
Melnick A, Cerchietti L (2013) Mechanism- Tam W, Elemento O (2015) Epigenomic evo-
based epigenetic chemosensitization therapy lution in diffuse large B-cell lymphomas. Nat
of diffuse large B-cell lymphoma. Cancer Commun 6:6921
5. Lin P-CC, Giannopoulou EG, Park K, Mos- 13. Andrews S (2010) FastQC: a quality control
quera JM, Sboner A, Tewari AK, Garraway LA, tool for high throughput sequence data.
Beltran H, Rubin MA, Elemento O (2013) http://www.bioinformatics.babraham.ac.uk/
Epigenomic alterations in localized and projects/fastqc/http://www.bioinformatics.
advanced prostate cancer. Neoplasia babraham.ac.uk/projects/. doi: citeulike-arti-
15:373–383. https://doi.org/10.1593/neo. cle-id:11583827
122146 14. Krueger F (2012) Trim Galore!. http://www.
6. Pike BL, Greiner TC, Wang X, Weisenburger bioinformatics.babraham.ac.uk/projects/
DD, Hsu Y-H, Renaud G, Wolfsberg TG, trim_galore/
Kim M, Weisenberger DJ, Siegmund KD, 15. Krueger F, Andrews SR (2011) Bismark: a flexi-
Ye W, Groshen S, Mehrian-Shai R, Delabie J, ble aligner and methylation caller for Bisulfite-
Chan WC, Laird PW, Hacia JG (2008) DNA Seq applications. Bioinformatics 27:1571–1572
methylation profiles in diffuse large B-cell lym- 16. Giannopoulou EG, Elemento O (2011) An
phoma and their relationship to gene expres- integrated ChIP-seq analysis platform with cus-
sion status. Leukemia 22:1035–1043. https:// tomizable workflows. BMC Bioinformatics
doi.org/10.1038/leu.2008.18 12:277
7. Esteller M (2002) CpG island hypermethyla- 17. Goodarzi H, Elemento O, Tavazoie S (2009)
tion and tumor suppressor genes: a booming Revealing global regulatory perturbations
present, a brighter future. Oncogene across human cancers. Mol Cell 36:900–911.
21:5427–5440 https://doi.org/10.1016/j.molcel.2009.11.
8. Shaknovich R, Geng H, Johnson NA, 016
Tsikitas L, Cerchietti L, Greally JM, Gascoyne 18. R Development Core Team (2011) R Founda-
RD, Elemento O, Melnick A (2010) DNA tion for Statistical Computing, Vienna AI
methylation signatures define molecular sub- 3-900051-07-0. R A Lang Environ Stat Com-
types of diffuse large B-cell lymphoma. Blood put 55:275–286
116:e81–e89
19. Lawrence M, Huber W, Pagès H, Aboyoun P,
9. Akalin A, Garrett-Bakelman FE, Carlson M, Gentleman R, Morgan MT, Carey
Kormaksson M, Busuttil J, Zhang L, VJ (2013) Software for computing and anno-
Khrebtukova I, Milne TA, Huang Y, tating genomic ranges. PLoS Comput Biol 9:
Biswas D, Hess JL, Allis CD, Roeder RG, e1003118
Valk PJM, Löwenberg B, Delwel R, Fernandez
HF, Paietta E, Tallman MS, Schroth GP, 20. Xi Y, Li W (2009) BSMAP: whole genome
Mason CE, Melnick A, Figueroa ME (2012) bisulfite sequence MAPping program. BMC
Base-pair resolution DNA methylation Bioinformatics 10:1–9
sequencing reveals profoundly divergent epige- 21. Smith AD, Chung WY, Hodges E, Kendall J,
netic landscapes in acute myeloid leukemia. Hannon G, Hicks J, Xuan Z, Zhang MQ
PLoS Genet 8:e1002781. https://doi.org/ (2009) Updates to the RMAP short-read
10.1371/journal.pgen.1002781 mapping software. Bioinformatics
10. Meissner A, Gnirke A, Bell GW, Ramsahoye B, 25:2841–2842
Lander ES, Jaenisch R (2005) Reduced repre- 22. Li H, Ruan J, Durbin R (2008) Mapping short
sentation bisulfite sequencing for comparative DNA sequencing reads and calling variants
high-resolution DNA methylation analysis. using mapping quality scores. Genome Res
Nucleic Acids Res 33:5868–5877 18:1851–1858
11. Sidow A, Spies N (2015) Concepts in solid 23. Chen P-Y, Cokus SJ, Pellegrini M (2010) BS
tumor evolution. Trends Genet 31:208–214 Seeker: precise mapping for bisulfite sequenc-
12. Landau DA, Clement K, Ziller MJ, Boyle P, ing. BMC Bioinformatics 11:203
Fan J, Gu H, Stevenson K, Sougnez C, 24. Kent WJ, Sugnet CW, Furey TS, Roskin KM,
Wang L, Li S, Kotliar D, Zhang W, Pringle TH, Zahler AM, Haussler a D (2002)
Ghandi M, Garraway L, Fernandes SM, Livak The Human Genome Browser at UCSC.
KJ, Gabriel S, Gnirke A, Lander ES, Brown JR, Genome Res 12:996–1006. https://doi.org/
Neuberg D, Kharchenko PV, Hacohen N, 10.1101/gr.229102
Getz G, Meissner A, Wu CJ (2014) Locally 25. Aken BL, Ayling S, Barrell D, Clarke L,
disordered methylation forms the basis of intra- Curwen V, Fairley S, Fernandez-Banet J,
tumor methylome variation in chronic lympho- Billis K, Garcia-Giron C, Hourlier T, Howe
cytic leukemia. Cancer Cell 26:813–825. KL, Kahari AK, Kokocinski F, Martin FJ, Mur-
https://doi.org/10.1016/j.ccell.2014.10.012 phy DN, Nag R, Ruffier M, Schuster M, Tang
YA, Vogel J-H, White S, Zadissa A, Flicek P,
Searle SMJ (2016) The Ensembl gene annota- 33. Subramanian A, Tamayo P, Mootha VK,
tion system. Database (Oxford) 2016:baw093. Mukherjee S, Ebert BL, Gillette MA,
https://doi.org/10.1093/database/baw093 Paulovich A, Pomeroy SL, Golub TR, Lander
26. Langmead B, Trapnell C, Pop M, Salzberg SL ES, Mesirov JP (2005) Gene set enrichment
(2009) Ultrafast and memory-efficient align- analysis: a knowledge-based approach for inter-
ment of short DNA sequences to the human preting genome-wide expression profiles. Proc
genome. Genome Biol 10:1–25. https://doi. Natl Acad Sci U S A 102:15545–15550
org/10.1186/gb-2009-10-3-r25. gb-2009- 34. Lai AY, Fatemi M, Dhasarathy A, Malone C,
10-3-r25 [pii]\r Sobol SE, Geigerman C, Jaye DL, Mav D,
27. Langmead B, Salzberg SL (2012) Fast gapped- Shah R, Li L, Wade PA (2010) DNA methyla-
read alignment with Bowtie 2. Nat Methods tion prevents CTCF-mediated silencing of the
9:357–359 oncogene BCL6 in B cell lymphomas. J Exp
28. Ashburner M, Ball CA, Blake JA, Botstein D, Med 207:1939–1950
Butler H, Cherry JM, Davis AP, Dolinski K, 35. Landan G, Cohen NM, Mukamel Z, Bar A,
Dwight SS, Eppig JT, Harris MA, Hill DP, Molchadsky A, Brosh R, Horn-Saban S, Zal-
Issel-Tarver L, Kasarskis A, Lewis S, Matese censtein DA, Goldfinger N, Zundelevich A,
JC, Richardson JE, Ringwald M, Rubin GM, Gal-Yam EN, Rotter V, Tanay A (2012) Epige-
Sherlock G (2000) Gene ontology: tool for the netic polymorphism and the stochastic forma-
unification of biology. The Gene Ontology tion of differentially methylated regions in
Consortium. Nat Genet 25:25–29 normal and cancerous tissues. Nat Genet
29. Shaffer AL, Wright G, Yang L, Powell J, Ngo V, 44:1207–1214. https://doi.org/10.1038/
Lamy L, Lam LT, Davis RE, Staudt LM (2006) ng.2442
A library of gene expression signatures to illu- 36. Eichten SR, Stuart T, Srivastava A, Lister R,
minate normal and pathological lymphoid biol- Borevitz JO (2016) DNA methylation profiles
ogy. Immunol Rev 210:67–85. https://doi. of diverse Brachypodium distachyon aligns
org/10.1111/j.0105-2896.2006.00373.x with underlying genetic diversity. Genome
30. Kolde R (2012) Package ‘pheatmap’. Res 26:1520–1531. https://doi.org/10.
Bioconductor:1–6 1101/gr.205468.116
31. Kanehisa M, Sato Y, Kawashima M, 37. Li S, Garrett-Bakelman F, Perl AE, Luger SM,
Furumichi M, Tanabe M (2016) KEGG as a Zhang C, To BL, Lewis ID, Brown AL,
reference resource for gene and protein anno- D’Andrea RJ, Ross ME, Levine R, Carroll M,
tation. Nucleic Acids Res 44:D457–D462 Melnick A, Mason CE (2014) Dynamic evolu-
tion of clonal epialleles revealed by methclone.
32. Ogata H, Goto S, Sato K, Fujibuchi W, Genome Biol 15:472
Bono H, Kanehisa M (1999) KEGG: Kyoto
encyclopedia of genes and genomes. Nucleic
Acids Res 27:29–34
Chapter 4
MicroRNA Networks in Breast Cancer Cells

Andliena Tahiri, Miriam R. Aure, and Vessela N. Kristensen
Abstract
A variety of molecular techniques can be used in order to unravel the molecular composition of cells. In
particular, the microarray technology has been used to identify novel biomarkers that may be useful in the
diagnosis, prognosis, or treatment of cancer. The microarray technology is ideal for biomarker discovery as
it allows for the screening of a large number of molecules at once. In this review, we focus on microRNAs
(miRNAs) which are key molecules in cells and regulate gene expression post-transcriptionally. miRNAs are
small, single-stranded RNA molecules that bind to complementary mRNAs. Binding of miRNAs to
mRNAs leads either to degradation, or translational inhibition of the target mRNA. Roughly one third
of all the mRNAs are postulated to be regulated by miRNAs. miRNAs are known to be deregulated in
different types of cancer, including breast cancer, and it has been demonstrated that deregulation of several
miRNAs can be used as biological markers in cancer. miRNA expression can for example discriminate
between normal, benign and malignant breast tissue, and between different breast cancer subtypes.
In the post-genomic era, an important task of molecular biology is to understand gene regulation in the
context of biological networks. Because miRNAs have such a pronounced role in cells, it is pivotal to
understand the mechanisms that underlie their control, and to identify how miRNAs influence cancer
development and progression.
Key words Biomarkers, Breast cancer, Cancer, Microarrays, microRNA, Systems biology
1 microRNA Biology
1.1 microRNAs: A The central dogma in molecular biology has for a long time been
Historical Perspective “DNA makes RNA that makes protein” [1]. However, the impact
of a gene on the phenotype is highly dependent on different
mechanisms that allow a particular gene to be turned “on” or
“off” in a particular state, in a particular cell, at a particular time.
One way this type of regulation can be performed is by small RNA
regulatory units called microRNAs (miRNAs). miRNAs are small,
non-protein-coding RNA molecules that function as negative reg-
ulators of gene expression either by inhibiting translation or induc-
ing degradation of messenger RNA (mRNA). Lin-4 was the first
Andliena Tahiri and Miriam R. Aure contributed equally to this work.
55
56 Andliena Tahiri et al.
miRNA that was discovered in the nematode Caenorhabditis ele-

gans (C. elegans) by Lee et al. in 1993 [2]. It took researchers
7 more years to identify another miRNA, let-7, in C. elegans
[3]. Let-7 consisted of only 21 nucleotides, and was identified to
have a significant role in nematode development. Resultantly, gene
expression studies were complemented with the studies of the novel
molecules regulating gene expression. Let-7 was thereafter identi-
fied in several organisms, including humans, with highly conserved
sequences among different species. Today, more than 2500 mature
human miRNAs have been annotated (miRBase version 21; [4]),
and more than 60% of all the genes are predicted to be regulated by
miRNAs [5].
After the initial description of miRNAs in C. elegans in 1993, it
took several years before the role of miRNAs started to be fully
appreciated. Over the last two decades, the number of papers
published on miRNAs has exploded. However, there are still
many unanswered questions regarding the detailed mechanisms
by which miRNAs exert their regulatory roles.
Part of the difficulty in studying miRNA function is due to the
complexity of miRNA biology. One miRNA may target several
genes and one mRNA transcript has putative binding sites for
various miRNAs. Thus, trying to dissect the in vivo connections
between miRNAs and target mRNAs is a complex combinatorial
challenge. Adding to the complexity of validating miRNA-mRNA
relations is the fact that miRNA expression is tissue and time-
specific, i.e., the context dependence is high.
1.2 miRNA The process of generating mature miRNAs in the cell consists of a
Biogenesis series of nuclear and cytoplasmic steps (see Fig. 1). miRNAs are
and Function encoded either independently of protein-coding genes (intergenic)
or inside introns of a host gene (intronic). Transcription occurs in
the nucleus by RNA polymerase II and produces a long primary
hairpin transcript called the primary miRNA (pri-miRNA). The
pri-miRNA is long (>1 kb) and contains a local stem–loop struc-
ture, which is cleaved by the microprocessor complex (RNase III
Drosha, in combination with DiGeorge syndrome critical region
gene 8) in order to generate a precursor miRNA (pre-miRNA)
[6, 7]. The pre-miRNA is exported to the cytoplasm by Exportin-
5 and RAN-GTP, where it is further processed by RNase III endo-
nuclease Dicer, to form a double-stranded miRNA duplex (~22 nt)
[8]. The duplex is made up of two mature miRNA strands (named
-5p and -3p depending on the 50 and 30 directions of the strand),
and is subsequently loaded onto an Argonaut (AGO) protein to
form an effector complex called the RNA-induced silencing com-
plex (RISC) [7]. Usually, the RNA-strand with the unstable 50 -end
is recruited into RISC, whereas the other strand (-3p) is released
and quickly degraded. However, some studies have shown that the
less abundant strand is also active in silencing, albeit usually less
The Role and Function of MicroRNAs in Normal and Pathological Processes 57
DNA
POL II
Pri-miRNA
Nucleus DROSHA DGCR8
Pre-miRNA
EXP5
Cytoplasm
EXP5
DICER
miRNA duplex Translational

inhibition
RISC
mRNA
miRNA strand
degradation
incorporated into
RISC
Fig. 1 The canonical miRNA biogenesis pathway and miRNA function (see the text for details). Pri-miRNA,
primary miRNA; EXP5, Exportin 5; POL II, RNA polymerase II; pre-miRNA, precursor miRNA; RISC, RNA-induced
silencing complex
potently than the more abundant guide strand [19]. Once the
mature miRNA strand is incorporated into the RISC complex,
the miRNA sequence targets mRNA through either perfect or
imperfect complimentary binding to the 30 untranslated region
(UTR), the coding region or the 50 -UTR of genes [9, 10]. Binding
of RISC to target mRNA can have different outcomes. Imperfect
complementary binding of miRNAs to their targets inhibits trans-
lation and reduces protein expression without affecting the mRNA
levels of these genes. Perfect complementary pairing between
miRNA and mRNA targets the mRNA for degradation by RISC
[11]. The exact mechanism of protein reduction is not fully under-
stood, but it is likely that it occurs through both RNA degradation
and translational repression pathways, with different miRNAs con-
tributing to each pathway in different proportions [12]. mRNA
degradation in mammals involves poly (A)-tail shortening (dead-
enylation) and other de-capping methods at the 50 -end of the
mRNA strand. It is believed that miRNAs regulate a substantial
portion of all protein coding genes. The complexity of mRNA
regulation through miRNAs is remarkable as each miRNA can
potentially regulate hundreds of genes, and one gene can be
regulated by several miRNAs. Additionally, several transcription

factors have been identified that can directly influence the expres-
sion of miRNAs [13–15]. miRNAs are considered important reg-
ulators of gene expression, involved in cellular processes such as
development, cell proliferation, apoptosis, metabolism, cell differ-
entiation, and stem cell division [16]; processes that are also highly
involved in cancer pathogenesis.
1.3 miRNA–Target The dominant target recognition sequence in the miRNA is termed
Gene Interactions the “seed” sequence and is located in nucleotides 2–8 in the
and Predictions miRNA from the 50 -end [17]. These positions in the miRNA are
often evolutionary conserved. Other compensatory rules for
miRNA-mRNA target recognition also exist, but all of them
include some degree of sequence complementarity. miRNA target
prediction is a major task in computational biology. Several in silico
approaches exist that predict targets for a given miRNA and are
described later in more detail. Those are based on different criteria
such as complementarity to the miRNA seed region, evolutionary
conservation of the miRNA recognition elements in the mRNA,
free energy of the miRNA-mRNA hetero-duplex, and mRNA
sequence features outside the target site [18, 19].
1.4 miRNA Function miRNAs are able to fine-tune the protein level of thousands of
on a Cancer Systems genes, either directly or indirectly, and thereby make fine-scaled
Level adjustments to protein output [20]. The variety and abundance of
targets offer an enormous level of combinatorial possibilities. This
high level of complexity suggests that miRNAs and their targets
form an intricate regulatory network intertwined with other cellular
networks. It is pivotal to understand how miRNAs regulate cellular
processes at the systems level, including miRNA regulation of
cellular networks, metabolic processes, protein interactions, and
gene regulatory networks. Studying different networks to assess
the influence of miRNAs on their targets will help to identify
miRNAs that have a strong influence on breast cancer development
and progression.
2 miRNAs in Cancer
2.1 Breast Cancer is a complex disease involving abnormal growth of cells,

Pathophysiology invasion to surrounding tissue, and migration and invasion to
distant sites. Breast cancer is the most common type of cancer and
cause of cancer-related deaths in women worldwide [21].
However, the most frequently observed abnormalities in the
breast are usually benign. Different benign conditions can take
place in the breast. Those can be divided into three groups based
on how they affect breast cancer risk [22]; (1) Non-proliferative
lesions such as cysts or fibrosis considered with almost no breast
cancer risk; (2) Proliferative lesions without atypia such as breast

fibroadenoma or fibroadenomatosis which show excessive growth
of cells in the ducts or lobules of the breast tissue, and slightly
increase a woman’s risk of developing breast cancer; (3) Proliferative
lesions with atypia such as atypical ductal/lobular hyperplasia
(ADH/ALH) show an overgrowth of cells in ducts or lobules of
the breast. They have a strong effect on breast cancer risk [23–28].
Although there are many types of benign lesions in the breast,
most research is focused on malignant breast tumors. Breast cancer
types can be grouped based on the origin of tumor formation into
invasive or noninvasive types of breast cancer. These include lobular
carcinoma in situ (LCIS), ductal carcinoma in situ (DCIS), invasive
ductal carcinoma (IDC), and invasive lobular carcinoma (ILC).
DCIS and LCIS are considered pre-invasive cancer as they can in
some cases metastasize [29]. IDC is the most common type of
breast cancer which starts in a milk duct of the breast, breaks
through the wall of the duct, and grows into the fatty tissue of
the breast. At this point, it may metastasize to other parts of the
body through the lymphatic system and bloodstream. ILC on the
other hand, starts in the lobules, and like IDC, it can metastasize.
There are about 5–15% of breast cancer cases that involve ILCs, and
they are more difficult to detect than IDCs through physical exam-
ination, mammography, and even through gross pathologic
evaluation [30].
Cancer has for a long time been viewed as a genetic disease.
However, as the unraveling of the molecular biology of breast
cancer progresses, it is no longer seen as a single disease, but rather
as a complex disease involving many subtypes with different out-
comes based on differences in the genetic makeup [31, 32]. For
example, in breast cancer the expression of estrogen receptor (ER),
progesterone receptor (PR), and human epidermal growth factor
receptor 2 (HER2)/neu have implications for prognosis and ther-
apy selection that are independent of TNM staging, which
describes tumor size or depth (T), lymph node spread (N), and
presence or absence of metastases (M) [33]. Ki-67 is a prognostic
marker of breast cancer that has recently been applied in the clinics
[34]. Ki-67 is a marker of proliferation, and higher percentage of
Ki-67 in breast cancers (usually above 15%) indicates worse
prognosis.
Thanks to advances in molecular biology and the use of micro-
array technology, breast cancer can be divided into four subtypes
based on the genetic profiles, with each having a different clinical
outcome [31, 32, 35–37]. These subgroups are named; (1) luminal
A (ERþ, HER2), (2) luminal B (ERþ, HER2þ/HER2),
(3) HER2-enriched, and (4) basal-like, also often termed triple
negative (ER, PR, HER2). The luminal A subtype has the
best prognosis compared to the other subtypes. Luminal B tumors
are characterized by high proliferation activity (Ki-67 index), may
be positive for HER2 expression, and have a worse prognosis than

luminal A tumors [38]. However, HER2 expression in the Luminal
B subtype is lower than in HER2-enriched tumors. The HER2-
enriched subtype is often associated with nodal metastasis, whereas
the basal-like often occurs in younger patients, is more frequently
associated with visceral organ metastasis, and has a very poor prog-
nosis [39]. It is important to note that not all triple negative tumors
are identified as basal-like by gene expression, and not all basal-like
tumors are triple negative [40].
2.2 Cancer A biomarker is a biological molecule present in any biological

Biomarkers material (e.g., tissue, cell, or body fluid) that can be used as a
measurable indicator of normal biological processes, pathogenic
processes, or response to therapy [41]. Biomarkers in cancer can
be divided into three main subgroups providing different purposes
in the clinic; (1) Risk assessment markers; (2) Diagnostic markers;
and (3) Prognostic markers (Table 1). Medical scientists still strive
to find the best biomarkers that can provide a reliable diagnosis, tell
us which therapy is the best for a particular patient, or even better;
tell us whether a person is at risk of getting a certain disease without
reaching the diseased stage. Early cancer detection would dramati-
cally reduce mortality associated with the disease; however, early
diagnosis relies on clinically validated biomarkers with high speci-
ficity and sensitivity.
Since 1985, the TNM staging [42] has provided doctors with
the basis for the prediction of survival, choice of treatment, and
stratification of patients. At the same time, it has provided consis-
tency among healthcare providers. In some cases, tumor grade,
histological subtype, or patient age would be added to TNM
staging when such information was important for the prediction
of survival or response to therapy. Today, the findings of new
Table 1
Different types of biomarkers important in clinical settings of cancer research
Biomarker Purpose Example References

1. Risk l Aid in cancer prevention Breast and ovarian [42]
assessment l Provide the earliest evidence of potential cancer cancer:
and screening in persons not yet diagnosed with the disease BRCA1/BRCA2
2. Diagnostic l Establish a diagnosis TNM staging [43]
l Assist with staging, grading, and selection of
initial therapy
3. Prognostic l Estimate the aggressiveness of a condition Breast cancer: ER, [31, 44,
and predictive l Predict how well a patient will respond to a PR, HER2, and 45]
specific treatment Ki-67
Melanoma: BRAF
(V600E)
molecular markers that can predict survival and efficacy of therapy

provide additional important information to TNM staging, and are
used in the clinics worldwide.
Next to genetic changes, post-transcriptional, posttranslational
modifications, and metabolic changes play a role in cancer forma-
tion [43–45]. Better appreciation of the complexity in carcinogen-
esis has provided us with a number of candidate biomarkers
valuable for risk assessment, screening, diagnosis, prognosis, and
selection and monitoring of therapy. Nonetheless, although several
markers are identified and used for diagnostic and prognostic pur-
poses with implications in therapy treatment, histological examina-
tion is still required for diagnosis, whereas immunohistochemistry
and genetic tests are utilized for treatment decisions and prognosis
determination.
2.3 miRNA Function The first discovery of the implication of miRNAs in cancer was
in Cancer observed in B-cell chronic lymphocytic leukemia (CLL) in the
search of tumor suppressors at chromosome 13q14, which is com-
monly deleted in CLL patients [46]. In this study, the authors
found that miR-15a and miR-16-1 were located in this region.
Since loss of this chromosome was frequent in CLL, it indicated
that loss of these miRNAs also occurred, raising the question
whether miRNAs could be involved in the pathogenesis of cancer.
Later, the same group identified several miRNAs located in fre-
quently deleted or amplified regions in the genome in different
tumors [47].
Iorio et al. in 2005 described the first breast cancer miRNA
signature which could discriminate tumors from normal tissues
[48]. Subsequent studies have increased our understanding of
miRNA involvement in breast cancer, and identified aberrant
miRNA expression related to survival, metastasis, stage, prolifera-
tion, molecular subtype, TP53 mutational status, hormone recep-
tor status, and response to treatment [49–53]. The studies revealed
that changes in miRNA expression profiles can serve as phenotypic
signatures of specific types of cancer. Aberrant miRNA expression
associated with tumorigenesis can be a result of various mechan-
isms. Several studies point to transcriptional deregulation, copy
number aberrations, mutations, epigenetic alterations, and defects
in the miRNA biogenesis machinery as contributors to miRNA
deregulation in cancer [54]. Some miRNAs may be causally linked
to tumorigenesis by directly modifying tumor-suppressor or onco-
genic pathways. For example, the overexpression of miRNAs can
inhibit tumor-suppressor genes in a pathway. Conversely, reduced
miRNA expression through loss-of-function mutations could result
in increased expression of oncogenes, also contributing to cancer
development and progression (see Fig. 2).
Oncogenic miRNA Tumor-suppressor gene
Cancer
Tumor-suppressor miRNA Oncogene
Fig. 2 miRNAs may have oncogenic or tumor-suppressive roles in cancer. Upregulation of oncogenic miRNAs
results in increased repression of tumor-suppressor target genes. Conversely, downregulation of tumor-
suppressor miRNAs results in decreased repression and thus increased expression of target oncogenes. Both
scenarios may lead to cancer development and progression. Figure based on Lujambio and Lowe [119]
2.4 miRNAs Various studies provide evidence that miRNAs can be used as
as Cancer Biomarkers biomarkers for different purposes [55, 56]. Deregulated expression
profiles of miRNAs have been discovered in a wide variety of human
cancers, including breast cancer [57], colorectal cancer [58], gli-
oma [59], lymphoma [60], and prostate cancer [61].
The survival and prognosis of a patient is highly dependent on
the stage of the tumor at the time of detection. The earlier a tumor
is detected, the better the prognosis is. Thus, a major clinical
challenge in cancer is the identification of biomarkers that can
detect cancer at an early stage. miRNAs can be reliably extracted
and detected from frozen and paraffin-embedded tissues. They can
moreover be found circulating freely in the blood or bound to
circulating exosomes, and in different body fluids like urine, saliva,
and sputum [62]. The fact that miRNAs are stable in body fluids,
and that they are easily detectable through noninvasive procedures
makes miRNAs attractive biomarker candidates. For example,
miRNA signatures in plasma had strong diagnostic and prognostic
potential detecting lung cancer before disease onset, as plasma
samples were collected 1–2 years before lung cancer was detected
by CT [63]. Another recent study by Cava et al. [64] showed that
miRNA profiling improved breast cancer classification and could
differentiate patients with breast cancer as responding or not
responding to therapy, with promising results. The correct classifi-
cation of breast cancer is a fundamental factor in determining the
appropriate treatment, and it is now evident that miRNAs have the
potential to provide new diagnostic, prognostic, and predictive
biomarkers for cancer, with a great impact in the clinics. However,

their use in the clinics has not been implemented yet as there still
are many hurdles to overcome.
3 Techniques for Studying miRNA Networks in Cancer
3.1 The Microarray Oligonucleotide microarray is a high-throughput technique based

Principle on hybridizing labeled sample material to complementary probes
that are immobilized on a solid surface. The amount of material
that has hybridized to the probes is quantified by a laser that scans
the array and excites the fluorescent dye attached to the labeled
sample. One array contains thousands of probes, each representing
a defined sequence that is complementary to an mRNA or miRNA
transcript. The microarray technology is a useful tool to study the
expression of thousands of miRNAs or mRNAs simultaneously.
Many different platforms exist with varying probe contents and
length, and labeling techniques. Figure 3 illustrate the steps of
Agilent-based miRNA/mRNA expression profiling.
Isolated total RNA

miRNA mRNA
mRNA
analysis analysis
analysis
P Incubation and hybridization
RT POL II
Washing
cDNA
Scanning
Labeled Labeled
RNA cRNA
Feature extraction
Fig. 3 miRNA and mRNA expression profiling using Agilent microarrays. RNA is labeled with a fluorescent dye
(Cyanine 3; Cy) and transferred to the microarray where the sample material hybridizes to complementary
probes during incubation. Then follows washing and scanning of the array, and finally feature extraction where
probe hybridization intensities are quantified. The protocol deviates slightly between microRNA and mRNA
analysis. For the former RNA is treated with phosphatase to remove the 30 -phosphate group (P), which is
followed by labeling. For mRNA profiling the RNA is first converted to complementary DNA (cDNA) by reverse
transcriptase (RT), and then the cDNA is further transcribed into complementary RNA (cRNA) by the use of RNA
polymerase (POL II) where labeled cytosine residues are incorporated
3.2 Functional Data from functional studies of miRNAs in cell lines can be gener-
Experiments ated after identifying interesting candidates from analyses of high-
to Validate miRNA throughput data. The aim is to determine whether the candidate
Targets and Their miRNA is functionally involved in cancer-associated processes. This
Effect on Cells can be done by testing the effects of silencing or overexpression of
the candidate miRNA on the viability and proliferation of cancer
cells. Knockdown of potential tumor driver miRNAs can be per-
formed using small, single-stranded anti-miRs which are miRNA
inhibitors that bind to and inhibit endogenous miRNAs [65]. Con-
versely, the effect of candidate tumor-suppressor miRNAs can be
assessed by overexpression, for example by adding miRNA mimics
and measuring the effect on cell viability. In order to effectively
study the functional role of miRNAs in cell lines, high-throughput
screens can be performed. Leivonen et al. used libraries of either
miRNA mimics or anti-miRNAs which were tested simultaneously
in large scale and used to measure the effect of miRNA overexpres-
sion or knockdown, respectively [66]. miRNAs can be spotted in
96- or 384-well formats, and incubated with cells from a cell line of
interest. The phenotypic end-points of such screens may measure
the effects that miRNAs have on cell viability, apoptosis, and prolif-
eration, as well as expression of marker proteins. Leivonen et al.
[66] performed a high-throughput screen to identify miRNAs that
were important for the growth of HER2-positive breast cancer
cells. They overexpressed miRNAs in HER2-positive cell lines and
assessed the effect on HER2 protein levels, proliferation (Ki67),
and apoptosis (cleaved PARP). Thirty-eight miRNAs were identi-
fied that inhibited HER2 signaling and cell growth. In another
study [53], miRNAs that were identified as differentially expressed
between high and low proliferative tumor samples (scored by
immunohistochemistry) were further functionally validated by
transfecting a library of pre-miR constructs into breast cancer cell
lines. The cells were lysed and the lysates printed on slides that were
then stained with an antibody against Ki67 to assess the effect of the
miRNAs on proliferation. Among the 123 identified differentially
expressed miRNAs, 13 showed a corresponding functional effect
on Ki67 protein levels [53].
The measurement of ATP using luciferase is one of the most
commonly used assays for assessing cell viability in high-
throughput screening applications [67]. The assay is fast and easy
to use, sensitive, and also less prone to artifacts than other viability
assay methods [67, 68]. However, the assay measures metabolically
active cells, which cannot be translated into viable cells in all con-
texts. Another method that is widely used is the MTT Tetrazolium
Reduction Assay. Yet, the MTT assay lacks sensitivity, is more time-
consuming and more prone to variation, due to multiple experi-
mental steps involved compared to the ATP assay [68]. Other
methods that are used to measure the effect of miRNAs on cell
viability or proliferation include the TUNEL assay, Trypan Blue
staining assay, Tetrazolium Reduction Assays, etc. The choice of

method relies on the investigators’ preferences, and there are both
benefits and pitfalls for each assay which have to be taken into
consideration.
3.3 Databases Different databases exist that list miRNAs, their chromosomal
and Tools location, sequence and their putative target genes. For example,
the miRBase database contains all published miRNA sequences and
annotations [4]. The Ingenuity Pathway Analysis database (IPA,
Ingenuity Systems; www.ingenuity.com) can be used to associate
genes correlated to candidate miRNAs with pathways and for vari-
ous gene annotation purposes. The SEEK tool [69] can be used to
identify and annotate genes that are co-expressed with miRNA-
correlated genes. Different computational tools are readily available
for the analysis of miRNA target sites, such as miRanda [70–72],
TargetScan [5, 73, 74], PicTar [75–77], and DianaMicroT-CDS
[78, 79] (Table 2). Those can be used to predict potential targets of
a miRNA that has been identified (for example in cancer tissue), or
vice versa, identify candidate miRNAs predicted to bind to a gene of
interest.
Feedback from functional validation results has greatly
improved the performance of these in silico miRNA target predic-
tion algorithms. The miRanda software was initially designed to
predict miRNA target genes in Drosophila melanogaster
[70, 71]. The algorithm searches for highly overlapping basepairs
in the 30 UTRs for identifying potential binding sites [70]. A higher
score is given for sequences which are complementary to the 50 end
of the miRNA compared to the 30 end, leading to higher prediction
scores for seed regions with perfect, or nearly perfect match.
TargetScan is an algorithm developed by Lewis et al. [74], and
was the first miRNA target prediction tool for the human genome,
using a different search approach than miRanda. TargetScan
searches for perfect complementarity in the seed region and beyond
[74]. If there is complementarity outside the seed region, it will
filter out the false positives more efficiently prior to prediction.
Data from conservation analysis derived from orthologous 30
Table 2
Computational algorithms for miRNA target prediction
Algorithm Website References

TargetScan www.targetscan.org [5, 71, 72]
miRanda www.microrna.org [68–70]
PicTar pictar.mdc-berlin.de [73–75]
DianaMicroT-CDS www.microrna.gr/microT-CDS [76, 77]
UTRs are used as input early in the process. Also, thermodynamic

stability is tested to filter predicted target sites [80].
PicTar is the first algorithm for analyzing miRNAs and target
mRNAs in co-expression at a specific time and place. The PicTar
software fully relies on data from several species to identify common
targets for miRNAs [75]. It uses conservation data from 30 UTR as
input and searches for alignment of complementary seed regions.
Binding sites are tested for thermodynamic stability and each result
is given a score [75, 80].
The DianaMicroT algorithm scans for larger complementarity
regions and focuses on coding regions of target mRNAs [79]. It
also calculates and uses the free energy of binding sites as an input
for the prediction of targets. Importantly, many miRNAs share
sequence composition and are thus grouped into families based
on sequence homology. Members of the same miRNA family are
believed to at least partly be able to target some of the same genes
due to this sequence similarity [71]. There is still a gap between in
silico predictions and knowledge of the in vivo relations, but further
advances in molecular technology will reduce this gap.
3.4 (Epi-) DNA aberrations are a hallmark of cancer genomes [81], and the
Genome–Transcrip- phenotypic effects of such alterations are commonly investigated
tome Analysis through the integration of genomic and transcriptomic data. Ana-
lyzing changes in DNA copy number can be used to identify
aberrant cancer genes. The correlation between copy number and
mRNA expression can be utilized to single out genes for which
DNA aberration is manifested in the altered expression of the gene.
In a similar manner, DNA copy number and methylation status can
be used together with miRNA expression to identify miRNAs
altered on the (epi-)genomic level with effects on the transcrip-
tomic level. The rationale behind such integrative approaches is
that recurrent alterations across tumor samples may indicate func-
tionality through the effect on the transcription levels of the
corresponding miRNAs or genes. Thus, RNA expression is used
as an additional layer to the genomic or epigenetic data to further
identify potential candidate genes. If a change in DNA copy num-
ber affects the expression of a miRNA, the miRNA is more likely to
be under selection in the tumor and hence might be important for
tumorigenesis.
Studies integrating DNA copy number and mRNA expression
in breast cancer have revealed a clear dosage effect of gene copy
number on gene expression [82, 83], which also holds true for
miRNA expression [84]. Lahti et al. [85] divided implementations
for the integrative analysis of DNA copy number and expression
into four main categories of approaches; two-step approaches,
correlation-based approaches, regression-based approaches, and
latent variable models. In a two-step approach, tumor samples
and miRNAs/genes are first grouped based on altered copy
number and/or methylation levels, and then in the second step,

differential expression is assessed between the different groups.
Both correlation and regression-based methods can be used; this
ensures a potential functional implication on the expression of the
altered miRNA/gene. For example, Aure et al. [84] investigated
the effect of DNA copy number and methylation alterations on
miRNA expression in breast cancer. First, each miRNA in each
patient was assigned to one of the two groups altered or
non-altered based on copy number or methylation status. Then,
Wilcoxon rank-sum tests were used to assess if the expression of a
given miRNA was different in the two groups considering altera-
tions on the copy number, the methylation level, or both. Using
this approach the authors identified miRNAs whose expression was
increased due to gain and/or hypomethylation. The authors fur-
ther identified miRNAs, whose expression was reduced due to loss
or hypermethylation of the miRNA gene. In this way, the study
provided evidence of the mechanisms behind miRNA dysregulation
in breast cancer. Interestingly, it was found that miRNAs from the
same family (i.e., sharing seed sequence and are predicted to regu-
late the same target genes) were altered by different mechanisms in
different patients, but with the same net effect on miRNA expres-
sion (increased or decreased), emphasizing alteration of miRNA
expression in breast cancer through variable genomic changes.
Comparative studies of methods integrating copy number and
expression data have shown that the different methods vary in
sensitivity and specificity, as well as in their performance in small
and large samples sizes [85, 86]. The objective of a study, e.g.,
sub-classification of tumor types or the identification of prognostic
or therapeutic targets, should decide which approach should be
used, together with the end-point chosen, e.g., altered genes, gene-
sets, pathways, or genomic regions [87]. However, important can-
cer genes or miRNAs may be overlooked by such integrative
approaches that require variation across samples and which focus
on simultaneous changes in both, e.g., copy number and expression
[85]. For example, despite an observed increase in the expression of
an oncogene, the in-cis correlation may be low if the increased
expression is caused by a mixture of amplification, mutation, or
hypomethylation across the patients.
3.5 Integration The development of high-throughput technologies has made it pos-

of Multi- sible to simultaneously profile the genome, epigenome, transcrip-
dimensional Data tome, and proteome of biological samples such as breast tumor
tissue taken from biopsies. These so-called multi-dimensional data
represents several molecular levels that together can be used to char-
acterize biological systems [88]. Uncovering the relations between
the biological components of these systems—DNA copy number,
methylation state, genes, mRNAs, miRNAs, and proteins—allows
approaching breast cancer at a system level. Integration of multi-
dimensional data from various molecular levels is required to reveal

the underlying system in greater detail. Bioinformatic approaches
address these challenges by representing the system as biological
networks and pathways [89]. The ultimate goal of taking such a
systems biology approach is to go from cancer genomes with all
their aberrations to cancer models where these aberrations may be
put in a system in order to identify common denominators, and
ultimately provide mechanistic insight into the development and
progression of cancer [90].
Due to the inherent complexity of cancer biology, a further
rationale for an integrative approach is that by combining data from
different levels and across patients, one may find cancer-relevant
events that might not have been found if only single layers were
assessed. For example, if expression of a gene is increased due to
gain, activating mutations, promoter hypomethylation, or altered
miRNA expression across patients, this would indicate that the
gene is a candidate oncogene, even though each alteration itself
may be infrequent [91]. Approaching breast cancer at the systems
biology level through top-down integration of multi-level
biological data is facilitated by having an outline on how to com-
bine the available data and tools before the analysis starts, and
includes several steps (see Fig. 4). Studies in which several “omics”
levels were integrated to examine the aberrations that occur in
breast cancer were previously performed [92, 93]. In the study by
Curtis et al. [93] a new integrative classification system of breast
tumors was identified based on both genomic and transcriptomic
data [43]. The authors performed an integrated analysis of DNA
copy number and gene expression, and identified novel subgroups
with distinct clinical outcomes, named iClusts 1–10. These sub-
groups include one high-risk ER-positive 11q13/14 cis-acting
subgroup and a favorable prognosis subgroup devoid of copy num-
ber alterations (CNA). Another study performed by Dvinge et al.
[92] performed a systems-level analysis of miRNA expression in
breast tumors by analyzing miRNA expression and integrating it
with matched mRNA expression and CNA [92]. The authors reveal
that at the whole-genome level, miRNAs behave more as fine-
tuners/modulators of gene expression. This modulatory role of
miRNAs was especially evident in CNA-devoid breast tumors, in
which the immune response is prominent.
Molecular data are typically generated on different scales and
units and must be processed prior to integration. Then, individual
relations and interactions must be identified in the data, and finally
put into the context of a larger system where alterations at the
global scale may be identified from the more local findings. This
is typically achieved by assessing the alterations found in the frame-
work of biological pathways or networks. To fully complement a
systems biology approach, findings from integration of high-
Components Relations and interactions Complex systems

DNA copy
number
aberrations
Transcription AAAA
DNA level
methylation
Gene Co-expression
Transcription
mRNA AAAA
regulation
miRNA Translation AAAA
regulation
Protein
Protein-protein
interaction
Individual Networks and
measurements Individual relations pathways
on various levels
Fig. 4 Integration and analysis of multi-dimensional data. Biological components are measured across
individuals and platforms, and their relations and interactions are identified. From this, complete networks
and pathways are overlaid or built, and the emerging system is interrogated for alterations. Figure based on
McDermott et al. [89]
throughput data must be combined with functional experiments to

further evaluate and validate the findings [94].
Several large-scale projects have been launched that molecularly
profile human tumors at multiple levels with the aim of integrating
various data types to reveal molecular mechanisms of cancer. Some
examples are The Cancer Genome Atlas (TCGA) (http://
cancergenome.nih.gov/) [95], METABRIC (Molecular Taxon-
omy of Breast Cancer International Consortium) [93], and the
International Cancer Genome Consortium (ICGC) (http://icgc.
org/). These studies have provided a comprehensive picture of the
great genetic diversity of breast cancers. They have moreover
increased the resolution of classification suggesting the presence
of additional molecular subgroups.
The most comprehensive molecular profiling of human breast
tumors published to date has been done by TCGA [95]. By inte-
grating DNA copy number, methylation data, somatic mutations,
exome sequencing, mRNA arrays, miRNA sequencing, and
reverse-phase protein arrays, the consortium identified four major
groups of breast cancer types. To a large extent the groups recapit-
ulate the molecular subtypes [95]. Using the expression of the 25%
most variable miRNAs, the TCGA study identified seven miRNA
subtypes by consensus non-negative matrix factorization clustering.
These miRNA subtypes correlated with the mRNA subtypes, ER,
PR and HER2 clinical status, but not with mutation status. The
TCGA study further confirmed that breast cancer is a heteroge-

neous disease; however, they suggested that most of the heteroge-
neity is found within, and not across the major subtypes.
Integrative studies aid in deciphering new subgroups and iden-
tify alterations seen across levels and patients that may ultimately
lead to interruptions at the pathway or network level. In practice,
the analyses and interpretation of multi-level high-throughput
information remain a daunting task. Challenges include data
handling, normalization and standardization, database annotation,
availability of patient clinical information, and dissection of intrinsic
tumor heterogeneity [90, 91]. To be able to exploit these large
amounts of data, further development of computational tools and
improved infrastructure is needed. From these integrative analyses,
new hypotheses can be generated that require experimental testing
and validation [91]. Succeeding in integrating multi-level data
holds the promise of a comprehensive understanding of the altera-
tions that are responsible for tumor initiation, maintenance, and
progression. Such findings may be further translated into improved
strategies for tumor sub-classification, early detection, more accu-
rate prognostication, and a tailored therapy regime, in addition to
revealing new targets for therapy.
4 miRNA Regulation in Breast Cancer
4.1 Methods to Study miRNAs play an important role in the post-transcriptional regula-
miRNA Regulation tion of gene expression. To date, the number of experimentally
and Target Validation validated targets is low compared to the hundreds of putative
targets predicted by the different in silico prediction algorithms
[96]. The most common methods for the validation of miRNA
targets include the transfection of reporter vector constructs or
mimic miRNAs into cells, or the use of miRNA inhibitors. Those
are followed by assessing the effects on mRNA (by, e.g., qRT-PCR,
microarrays or sequencing) or protein levels (by, e.g., western blot)
of the putative miRNA targets. The challenge entailed in these
techniques lies in distinguishing direct from indirect effects
[96]. Alternatively, direct methods for the validation of miRNA
targets are based on the immunoprecipitation of the RISC complex
together with the bound miRNA-mRNA complex. RNA isolated
by crosslinking immunoprecipitation (HITS-CLIP) can then be
analyzed by high-throughput sequencing [97]. Yet, also for
Co-IP protocols, unspecific binding or co-isolation of secondary
binders is common.
Most analyses of miRNA crosslinking to date have not included
protein data. Indeed, the majority of studies modeling the regu-
latory impact of miRNAs have been performed on joint miRNA-
mRNA expression data. While the physical interaction takes place
between miRNA and mRNA, in order to validate a true miRNA-
mRNA relation, an effect on the protein level is the ultimate proof

as it gives the final phenotype of miRNA regulation. Depending on
the mechanism of miRNA regulation one may anticipate different
outcomes. For example, negative correlation between miRNA and
mRNA may be expected if the miRNA regulation leads to degrada-
tion of the mRNA. However, if translational inhibition is the
mechanism of action, such negative correlation between miRNA
and mRNA may not be observed. miRNA, mRNA, and protein
expression data have been integrated in order to study potentially
direct and indirect effects of miRNA on protein expression [98]. In
a study by Aure et al., protein expression was modeled as a function
of miRNA and mRNA expression. The model considered both the
effect of one miRNA at a time, and also all miRNAs combined. The
resulting comprehensive “interactome” map of miRNAs in breast
cancer revealed extensive coordination between miRNA and pro-
tein expression with groups of miRNAs coordinately interacting
with groups of proteins, thus suggesting “block interactions”
[98]. In order to suggest possible direct regulatory interactions
between miRNAs and mRNAs, the use of intersected target predic-
tion outputs aided in proposing candidates that should be further
functionally assessed by biochemical experiments.
4.2 Dissecting Altered miRNA expression in cancer has been extensively reported;
the Functional Role however, there are still many unanswered questions regarding the
of miRNAs in Breast role of miRNAs in cancer. miRNAs, which are differentially
Cancer expressed between samples of different molecular subtypes, TP53
mutation status, and ER status have been described in breast cancer
[53]. The causes of miRNA deregulation in breast cancer have been
investigated by trying to comprehensively study the effect of DNA
methylation and copy number aberrations of miRNA loci and
couple those to miRNA expression [84]. Identifying the various
mechanisms underlying perturbation of miRNA levels will help us
to understand more about the role of miRNAs in tumor develop-
ment and also about miRNA biology in general.
Dissecting the functional role of miRNAs is a challenging task due
to several aspects. miRNA families have likely arisen due to gene
duplication events [99], and members of the same miRNA family
have a high degree of similarity in sequence. In some cases, members
of a miRNA family are also encoded in the same polycistronic tran-
script [100]. Sequence similarities suggest that they may target the
same genes and thus have potentially overlapping functions. From an
evolutionary perspective, the mRNA 30 -UTR where the miRNA tar-
geting most often occurs, is not constrained by coding needs and thus
has the potential to be subject to selection so that beneficial miRNA-
mRNA target interactions may evolve [101]. Moreover, miRNAs
originating from the same polycistronic transcript or encoded in
close proximity have a high chance of being co-expressed. Hence,
untangling the role of individual miRNAs is complicated.
Several target prediction algorithms have been published [80],

as previously described. However, the false-positive rate of those
predictions has generally been high [102], and the degree of over-
lap between the algorithms differs. Each miRNA is predicted to
target hundreds of genes, including many with various functions.
Thus, as there is a lack of high-throughput methods to validate
miRNA-mRNA interactions, it can be a challenge to prioritize on
which target genes to focus [96]. miRNA expression has also been
shown to be both time- and context-dependent [103, 104] which
can potentially reduce the transferability of validated relations from,
e.g., one cell type to another. Recent studies also suggest that it is in
the very nature of miRNAs to confer only subtle effects on a
target’s protein level (typically less than 50%) rather than conferring
a total abolishment of protein expression [20, 105]. Furthermore,
the effect of miRNA-mediated regulation on mRNA can generally
result from two different mechanisms, either translational inhibi-
tion or mRNA degradation, which subsequently will confer differ-
ent results when trying to model miRNA-mRNA interactions from
high-throughput data. All these considerations should be kept in
mind when studying the role of miRNAs in cancer.
As more miRNA properties emerge, the key to understanding
the biological role of miRNAs may slowly be revealed as new
hypotheses to be tested develop from the pieces added to the
puzzle. miRNAs have many putative target genes, are expressed in
a context-dependent manner, and function as cell fate switches or
buffers. Integrating those properties with a view of cancer as a
disease, which is driven by perturbations at the signaling network
level [106], suggests that miRNAs may function as effective nodes
in protein signaling networks [101, 107]. Signaling cascades, which
transfer extracellular signals into cellular responses, depend on
dynamic and transient action, and often involve complex feed-
back and feed-forward loops. Proteins are key elements in signaling
networks, but miRNAs with their potential to regulate multiple
targets simultaneously could play a very efficient and timely role in
the tuning of signaling networks. Using mass spectrometry to
investigate the effect of miRNA regulation on proteins indicated
that miRNAs, for most interactions, act to make fine-scale adjust-
ments to protein expression levels [20]. However, with a fine-
tuning role in a normal cellular state, aberrant miRNA expression
may represent acquired signaling capabilities [106]. Aberrant
miRNA expression can have a substantial effect on pathway out-
come by disturbing the tight regulation, thus contributing to a
malignant phenotype of the cell. Studying both direct and indirect
effects of miRNAs may unveil important core miRNAs that can
further be used as markers of disease. Finally, the use of miRNA
expression together with protein expression might give a robust
proxy of a disease state as they together may constitute a “state-of-
the-network” signature.
4.3 miRNAs The field of miRNA biology is rather new, considering that the first
as Clinical Biomarkers miRNA was discovered in 1993, and there are still new miRNAs
for Diagnostic being identified today. During the past 10 years, miRNA research
and Predictive has advanced rapidly, and has produced new knowledge about the
Purposes in Breast molecular basis of cancer, tools for molecular classification, and new
Cancer markers with diagnostic and prognostic relevance [62]. miRNAs
are considered suitable biomarkers for early cancer detection
because they are present and stable in human serum and plasma
[108]. miRNA alterations during breast cancer progression from
DCIS to invasive cancer have recently been identified within the
intrinsic subtypes, Luminal A, luminal B, HER2-enriched, and
basal-like [109, 110]. For immunohistochemical-based subtypes
no miRNAs are differentially expressed between DCIS and the
luminal subtypes. Six miRNAs were downregulated in ER/
HER2þ invasive samples compared to DCIS, of which five belong
to the miR-30 family, whereas miR-139-5p was downregulated in
both ER subtypes, while miR-887-3p was downregulated in
triple-negative breast cancer only [109]. This study found that
subtype stratification based on molecular signatures resulted in
more correct classification than stratification based on ER, PR,
and HER2 alone, indicating a better representation of the intrinsic
biology of the samples.
Although the focus has previously been on identifying molecu-
lar differences between cancerous and normal tissue, we often tend
to forget that abnormal cell growth also occurs at benign stages. As
discussed earlier in this chapter, previous studies have shown that
certain types of benign tumors can increase the risk of breast cancer
[22, 23, 25–28]. As the use of mammography has increased, the
identification of benign breast disease has become more common.
Thus, having accurate risk estimates for women who receive this
diagnosis is vital. Moreover, with the distinction between benign
tumors and malignant tumors, Tahiri et al. [111] identified that
deregulation of known cancer-related miRNAs is evident also in
fibroadenomas and fibroadenomatosis, considered as benign
lesions in the breast. These cancer-related miRNAs included
miR-21, members of the let-7 family and other miRNAs well
known to be included in malignant transformation [111]. The
level of deregulation in benign tumors was less pronounced than
that observed in malignant tumors. Nevertheless, the identification
of tumor-associated miRNAs in benign tumors hinted that similar
processes are in place already at early stages of tumor formation.
The identification of miRNAs that can be assigned to either benign
or malignant groups of tumor tissue would be important for diag-
nostic purposes, but these results need to be further strengthened
by independent confirmations.
When identifying a signature of miRNAs through expression
arrays, there are different points to take into consideration. First,
there is reported lack of consistency between different studies that
certainly give rise to some concern of the use of miRNAs as bio-

markers for the clinics [112]. Such differences might arise from
sample selection or preparation, experimental design, and/or data
analysis. Also, the technology used is important in the search for
biomarkers. A recent study compared the expression of more than
2000 miRNAs by microarray technology and next-generation
sequencing [113]. The authors observed highly significant depen-
dency of the miRNA nucleotide composition on the expression
level. Uracil-rich miRNAs showed higher expression levels when
analyzed by next-generation sequencing. In contrast, guanine-rich
miRNAs were detected at higher levels in microarrays. While iden-
tifying subsets of miRNAs that had high correlation with both
technologies, correlation was observed only for miRNAs in the
early miRBase versions (<8). Also, one of the major problems
with both technologies was the elimination of low abundance
miRNAs that may potentially have a great impact on overall pro-
cesses. Keeping this in mind, respective bias will potentially slow
down the translational process to clinical application.
Moreover, the use of different controls for data normalization
can explain some of the observed variability across studies. Another
possibility that must be considered is the dynamic and immediate
regulation in miRNA levels in stress response and in hypoxia. As a
result, time of sample collection and sample processing could fur-
ther impact miRNA levels [62].
Despite those challenges, evidence reported up to date is
encouraging. Even though a more comprehensive validation is
still needed, the usefulness of miRNAs as biomarkers could espe-
cially be strengthened if it would be possible to identify deregulated
levels of miRNAs in the circulation of patients not yet presented
with the disease, or patients diagnosed with benign tumors.
4.4 Clinical As miRNAs have been reported to act as tumor-suppressor or

Implications oncogenic miRNAs, they have emerged as potential targets for
of miRNAs: Prospects therapy. miRNA expression signatures have been shown to function
for Therapy as classifiers for diagnosis, prognosis, and therapeutic response in
cancer [56], and were associated with breast cancer subtypes and
clinical subgroups. Notably, due to the tissue-specific expression of
miRNAs, Rosetta Genomics has commercially launched an assay
that is used to identify the tumor of origin in cancers of unknown
primary origin. This assay measures the expression of 64 miRNAs
which are further processed by an algorithm that can accurately
identify the origin of a patient’s tumor for 42 different cancer types
[114]. The algorithm uses two classifiers, a binary decision tree in
which the decision is made at each node by comparing the expres-
sion of a certain combination of miRNAs to a preset threshold, and
a k-nearest-neighbor algorithm that uses a confidence measure by
comparing the expression of the 64 miRNAs to the training
samples.
Besides the fact that miRNAs are implicated in cancer, the

above-discussed ability of miRNAs to regulate several genes in a
pathway makes them interesting therapeutic agents as they may
coordinate the response of an entire signaling network. Thus, by
modifying miRNA activity it could be possible to restore homeo-
stasis in cancer cells by rewiring network connections, and hence
reverse a cancer phenotype [115]. Furthermore, as the absolute
number of deregulated miRNAs is lower than for protein-coding
genes, it might be easier to distinguish the drivers from the passen-
gers and targeting miRNAs therapeutically may prove more suc-
cessful than targeting single genes or proteins [115].
There are two main therapeutic strategies for modulation of
miRNA expression. The first involves the restoration of tumor-
suppressor miRNA activity, and the other is based on inhibiting
the activity of oncogenic miRNAs [116]. An alternative indirect
approach is to use drugs to modulate the miRNA expression by
targeting steps in their biogenesis such as transcription or proces-
sing [115]. Restoration of tumor-suppressor miRNA expression
can be achieved for example by introducing double-stranded
miRNA mimics that are synthetic oligonucleotides with identical
sequence as the selected tumor-suppressor miRNA [115]. There is,
however, a long way from in vitro cell line experiments to clinical
trials. The successful delivery of miRNAs to tumor cells is a major
general challenge as unmodified, synthetic oligonucleotides are
rapidly degraded by nucleases, and owing to their size and negative
charge they may be prevented to cross the cell membrane
[117]. Pharmacological blocking of oncogenic miRNAs has been
achieved by using chemically modified antisense oligonucleotides
[116]. These antisense strands function as competitive inhibitors of
miRNAs by physically annealing to the mature miRNA and inhibit-
ing its function. By introducing modifications to the chemical
structure of the oligonucleotides such as for example locked nucleic
acids (LNA), the stability, specificity, and binding affinity could be
increased [115]. As an example of this antisense technology,
miR-21 knockdown through LNA silencing was shown to inhibit
proliferation and migration in human breast cancer cell lines and
tumor growth in mice [118]. An alternative to antisense oligonu-
cleotides are miRNA sponges that are transcribed from expression
vectors and which contain multiple tandem-binding sites to a
miRNA of interest [115]. They function as miRNA decoys by
competing with the endogenous bona fide target mRNAs for
miRNA binding, thus decreasing the miRNA effect.
Though their ability to regulate many genes simultaneously
makes miRNA attractive as therapeutic candidates, this feature
also implies that targeting miRNAs could lead to potential
off-target effects. The high context dependence of miRNA action
leads to potentially different functions in different tissues, which is a
challenge in this regard. By designing effective systems that deliver
the synthetic miRNA oligonucleotides specifically to the diseased

tissue or cancer cells, these problems could be solved
[115]. Another concern is the potential overloading of the endog-
enous miRNA processing machinery as the synthetic oligonucleo-
tides could saturate the RISC complex and displace other
endogenous miRNAs, which could potentially cause toxicity
[115, 117]. Finally, due to sequence similarities between members
of the same miRNA family, antisense miRNA therapy should be
considered to target all family members in case of functional
redundancy.
The success of miRNA-based therapy will depend on solving
technical issues such as effective and specific delivery of miRNAs.
Also, increased knowledge about miRNA function to identify
potential therapeutic niches and to foresee the downstream effects
is needed. As such, miRNA-based therapeutics could offer oppor-
tunities for a network therapy for cancer, focusing on the miRNAs
rather than the protein-coding oncogenes which may be more
difficult to target therapeutically [101]. miRNA signatures, e.g.,
from circulating miRNAs in breast cancer patients, are currently
used in human clinical trials with the majority of the studies focus-
ing on miRNA signatures as biomarkers for diagnosis, prognosis, or
therapeutic response [56]. Overall, it will be exciting to follow the
developments in novel efforts for therapeutically targeting miRNAs
and its implications for cancer therapy and personalized medicine.
Acknowledgments
Parts of this review have been part of two doctoral theses from the
University of Oslo, Norway, under the supervision of V.N.K.: one
of M.R.A., fellow of the Research Council of Norway, and one of A.
T., fellow of the South-Eastern Norway Regional Health Authority.
Both are at present postdoctoral fellows of the South-Eastern Nor-
way Regional Health Authority.
References
1. Crick F (1970) Central dogma of molecular deep-sequencing data. Nucleic Acids Res 39
biology. Nature 227(5258):561 (Database Issue)):D152–D157
2. Lee RC, Feinbaum RL, Ambros V (1993) 5. Friedman RC, Farh KK-H, Burge CB, Bartel
The C. elegans heterochronic gene lin-4 DP (2009) Most mammalian mRNAs are
encodes small RNAs with antisense comple- conserved targets of microRNAs. Genome
mentarity to lin-14. Cell 75(5):843–854 Res 19(1):92–105
3. Reinhart BJ, Slack FJ, Basson M, Pasquinelli 6. Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J,
AE, Bettinger JC, Rougvie AE, Horvitz HR, Lee J, Provost P, Radmark O, Kim S et al
Ruvkun G (2000) The 21-nucleotide let-7 (2003) The nuclear RNase III Drosha initi-
RNA regulates developmental timing in Cae- ates microRNA processing. Nature 425
norhabditis elegans. Nature 403 (6956):415–419
(6772):901–906 7. Ha M, Kim VN (2014) Regulation of micro-
4. Kozomara A, Griffiths-Jones S (2011) miR- RNA biogenesis. Nat Rev Mol Cell Biol 15
Base: integrating microRNA annotation and (8):509–524
8. Kolb FA, Zhang H, Jaronczyk K, Tahbaz N, 22. Dupont WD, Page DL (1985) Risk factors for
Hobman TC, Filipowicz W (2005) Human breast cancer in women with proliferative
dicer: purification, properties, and interaction breast disease. N Engl J Med 312(3):146–151
with PAZ PIWI domain proteins. Methods 23. Dupont WD, Page DL, Parl FF, Vnencak-
Enzymol 392:316–336 Jones CL, Plummer WD Jr, , Rados MS,
9. Forman JJ, Legesse-Miller A, Coller HA Schuyler PA: Long-term risk of breast cancer
(2008) A search for conserved sequences in in women with fibroadenoma. N Engl J Med
coding regions reveals that the let-7 micro- 1994, 331(1):10–15
RNA targets Dicer within its coding 24. McPherson K, Steel CM, Dixon JM (2000)
sequence. Proc Natl Acad Sci U S A 105 ABC of breast diseases. Breast cancer-
(39):14879–14884 epidemiology, risk factors, and genetics. BMJ
10. Lytle JR, Yario TA, Steitz JA (2007) Target 321(7261):624–628
mRNAs are repressed as efficiently by 25. Worsham MJ, Raju U, Lu M, Kapke A,
microRNA-binding sites in the 50 UTR as in Botttrell A, Cheng J, Shah V, Savera A, Wol-
the 30 UTR. Proc Natl Acad Sci U S A 104 man SR (2009) Risk factors for breast cancer
(23):9667–9672 from benign breast disease in a diverse popu-
11. Dennis C (2002) The brave new world of lation. Breast Cancer Res Treat 118(1):1–7
RNA. Nature 418(6894):122–124 26. Fitzgibbons PL, Henson DE, Hutter RV
12. Sullivan RP, Leong JW, Fehniger TA (2013) (1998) Benign breast changes and the risk
MicroRNA regulation of natural killer cells. for subsequent breast cancer: an update of
Front Immunol 4:44 the 1985 consensus statement. Cancer Com-
13. Boyer LA, Lee TI, Cole MF, Johnstone SE, mittee of the College of American Patholo-
Levine SS, Zucker JR, Guenther MG, Kumar gists. Arch Pathol Lab Med 122
RM, Murray HL, Jenner RG et al (2005) (12):1053–1055
Core transcriptional regulatory circuitry in 27. McDivitt RW, Stevens JA, Lee NC, Wingo
human embryonic stem cells. Cell 122 PA, Rubin GL, Gersell D (1992) Histologic
(6):947–956 types of benign breast disease and the risk for
14. O’Donnell KA, Wentzel EA, Zeller KI, Dang breast cancer. The Cancer and Steroid Hor-
CV, Mendell JT (2005) c-Myc-regulated mone Study Group. Cancer 69
microRNAs modulate E2F1 expression. (6):1408–1414
Nature 435(7043):839–843 28. Cole P, Mark Elwood J, Kaplan SD (1978)
15. Marson A, Levine SS, Cole MF, Frampton Incidence rates and risk factors of benign
GM, Brambrink T, Johnstone S, Guenther breast neoplasms. Am J Epidemiol 108
MG, Johnston WK, Wernig M, Newman J (2):112–120
et al (2008) Connecting microRNA genes to 29. Sgroi DC (2010) Preinvasive breast cancer.
the core transcriptional regulatory circuitry of Annu Rev Pathol 5:193–221
embryonic stem cells. Cell 134(3):521–533 30. Johnson K, Sarma D, Hwang ES (2015) Lob-
16. Mattick JS, Makunin IV (2006) Non-coding ular breast cancer series: imaging. Breast Can-
RNA. Hum Mol Genet 15:R17–R29 cer Res 17:94
17. Czech B, Hannon GJ (2011) Small RNA sort- 31. Perou CM, Sorlie T, Eisen MB, van de
ing: matchmaking for Argonautes. Nat Rev Rijn M, Jeffrey SS, Rees CA, Pollack JR,
Genet 12(1):19–31 Ross DT, Johnsen H, Akslen LA et al (2000)
18. Martin G, Schouest K, Kovvuru P, Spillane C Molecular portraits of human breast tumours.
(2007) Prediction and validation of micro- Nature 406(6797):747–752
RNA targets in animal genomes. J Biosci 32 32. Sorlie T, Perou CM, Tibshirani R, Aas T,
(6):1049–1052 Geisler S, Johnsen H, Hastie T, Eisen MB,
19. Thomas M, Lieberman J, Lal A (2010) Des- van de Rijn M, Jeffrey SS et al (2001) Gene
perately seeking microRNA targets. Nat expression patterns of breast carcinomas dis-
Struct Mol Biol 17(10):1169–1174 tinguish tumor subclasses with clinical impli-
20. Baek D, Villen J, Shin C, Camargo FD, Gygi cations. Proc Natl Acad Sci U S A 98
SP, Bartel DP (2008) The impact of micro- (19):10869–10874
RNAs on protein output. Nature 455 33. Gown AM (2008) Current issues in ER and
(7209):64–71 HER2 testing by IHC in breast cancer. Mod
21. Jemal A, Bray F, Center MM, Ferlay J, Pathol 21(Suppl 2):S8–S15
Ward E, Forman D (2011) Global cancer sta- 34. de Azambuja E, Cardoso F, de Castro G,
tistics. CA Cancer J Clin 61(2):69–90 Colozza M, Mano MS, Durbecq V,
Sotiriou C, Larsimont D, Piccart-Gebhart
MJ, Paesmans M (2007) Ki-67 as prognostic 45. Wu W, Zhao S (2013) Metabolic changes in
marker in early breast cancer: a meta-analysis cancer: beyond the Warburg effect. Acta Bio-
of published studies involving 12 155 chim Biophys Sin Shanghai 45(1):18–26
patients. Br J Cancer 96(10):1504–1513 46. Calin GA, Dumitru CD, Shimizu M, Bichi R,
35. van’t Veer LJ, Dai H, van de Vijver MJ, He Zupo S, Noch E, Aldler H, Rattan S,
YD, Hart AA, Mao M, Peterse HL, van der Keating M, Rai K (2002) Frequent deletions
Kooy K, Marton MJ, Witteveen AT et al and down-regulation of micro-RNA genes
(2002) Gene expression profiling predicts miR15 and miR16 at 13q14 in chronic lym-
clinical outcome of breast cancer. Nature phocytic leukemia. Proc Natl Acad Sci U S A
415(6871):530–536 99(24):15524–15529
36. Enerly E, Steinfeld I, Kleivi K, Aure MR, Lei- 47. Calin GA, Sevignani C, Dan Dumitru C,
vonen SK, Johnsen H, Kallioniemi O, Kris- Hyslop T, Noch E, Yendamuri S,
tensen VN, Yakhini Z, Borresen-Dale AL Shimizu M, Rattan S, Bullrich F, Negrini M
(2010) Molecular characterization of breast et al (2004) Human microRNA genes are
cancer subtypes derived from joint analysis of frequently located at fragile sites and genomic
high throughput miRNA and mRNA data. regions involved in cancers. Proc Natl Acad
EJC Suppl 8(5):164 Sci U S A 101(9):2999–3004
37. Sotiriou C, Neo SY, McShane LM, Korn EL, 48. Iorio MV, Ferracin M, Liu C-G, Veronese A,
Long PM, Jazaeri A, Martiat P, Fox SB, Harris Spizzo R, Sabbioni S, Magri E, Pedriali M,
AL, Liu ET (2003) Breast cancer classification Fabbri M, Campiglio M et al (2005) Micro-
and prognosis based on gene expression pro- RNA gene expression deregulation in human
files from a population-based study. Proc Natl breast cancer. Cancer Res 65(16):7065–7070
Acad Sci U S A 100(18):10393–10398 49. Tavazoie SF, Alarcon C, Oskarsson T,
38. Inic Z, Zegarac M, Inic M, Markovic I, Padua D, Wang Q, Bos PD, Gerald WL, Mas-
Kozomara Z, Djurisic I, Inic I, Pupic G, Jancic sague J (2008) Endogenous human micro-
S (2014) Difference between luminal A and RNAs that suppress breast cancer metastasis.
luminal B subtypes according to Ki-67, tumor Nature 451(7175):147–152
size, and progesterone receptor negativity 50. Yan L-X, Huang X-F, Shao Q, Huang MAY,
providing prognostic information. Clin Med Deng L, Wu Q-L, Zeng Y-X, Shao J-Y (2008)
Insights Oncol 8:107–111 MicroRNA miR-21 overexpression in human
39. Subik K, Lee JF, Baxter L, Strzepek T, breast cancer is associated with advanced clin-
Costello D, Crowley P, Xing L, Hung MC, ical stage, lymph node metastasis and patient
Bonfiglio T, Hicks DG et al (2010) The poor prognosis. RNA 14(11):2348–2360
expression patterns of ER, PR, HER2, 51. Castellano L, Giamas G, Jacob J, Coombes
CK5/6, EGFR, Ki-67 and AR by immuno- RC, Lucchesi W, Thiruchelvam P, Barton G,
histochemical analysis in breast cancer cell Jiao LR, Wait R, Waxman J et al (2009) The
lines. Breast Cancer (Auckl) 4:35–41 estrogen receptor-a-induced microRNA sig-
40. Prat A, Adamo B, Cheang MC, Anders CK, nature regulates itself and its transcriptional
Carey LA, Perou CM (2013) Molecular char- response. Proc Natl Acad Sci U S A 106
acterization of basal-like and non-basal-like (37):15732–15737
triple-negative breast cancer. Oncologist 18 52. Cittelly D, Das P, Spoelstra N, Edgerton S,
(2):123–133 Richer J, Thor A, Jones F (2010) Downregu-
41. Atkinson AJ, Colburn WA, DeGruttola VG, lation of miR-342 is associated with tamoxi-
DeMets DL, Downing GJ, Hoth DF, Oates fen resistant breast tumors. Mol Cancer 9
JA, Peck CC, Schooley RT, Spilker BA et al (1):317
(2001) Biomarkers and surrogate endpoints: 53. Enerly E, Steinfeld I, Kleivi K, Leivonen S-K,
preferred definitions and conceptual frame- Aure MR, Russnes HG, Rønneberg JA,
work. Clin Pharmacol Therap 69(3):89–95 Johnsen H, Navon R, Rødland E et al
42. Sobin LH (2003) TNM: evolution and rela- (2011) miRNA-mRNA integrated analysis
tion to other prognostic factors. Semin Surg reveals roles for miRNAs in primary breast
Oncol 21(1):3–7 tumors. PLoS One 6(2):e16915
43. Karve TM, Cheema AK (2011) Small changes 54. Deng S, Calin GA, Croce CM, Coukos G,
huge impact: the role of protein posttransla- Zhang L (2008) Mechanisms of microRNA
tional modifications in cellular homeostasis deregulation in human cancer. Cell Cycle 7
and disease. J Amino Acids 2011:207691 (17):2643–2646
44. Sharma S, Kelly TK, Jones PA (2010) Epige- 55. Bertoli G, Cava C, Castiglioni I (2015)
netics in cancer. Carcinogenesis 31(1):27–36 MicroRNAs: new biomarkers for diagnosis,
prognosis, therapy prediction and therapeutic 67. Kepp O, Galluzzi L, Lipinski M, Yuan J, Kroe-
tools for breast cancer. Theranostics 5 mer G (2011) Cell death assays for drug dis-
(10):1122–1143 covery. Nat Rev Drug Discov 10(3):221–237
56. Nana-Sinkam SP, Croce CM (2013) Clinical 68. Riss TL, Moravec RA, Niles AL, Duellman S,
applications for microRNAs in cancer. Clin Benink HA, Worzella TJ, Minor L (2004)
Pharmacol Ther 93(1):98–104 Cell viability assays. In: Sittampalam GS,
57. Ouyang M, Li Y, Ye S, Ma J, Lu L, Lv W, Coussens NP, Brimacombe K, Grossman A,
Chang G, Li X, Li Q, Wang S et al (2014) Arkin M, Auld D, Austin C, Bejcek B,
MicroRNA profiling implies new markers of Glicksman M, Inglese J et al (eds) Assay guid-
chemoresistance of triple-negative breast can- ance manual. Eli Lilly & Company, Bethesda,
cer. PLoS One 9(5):e96228 MD
58. Dong Y, Wu WK, Wu CW, Sung JJ, Yu J, Ng 69. Zhu Q, Wong AK, Krishnan A, Aure MR,
SS (2011) MicroRNA dysregulation in colo- Tadych A, Zhang R, Corney DC, Greene
rectal cancer: a clinical perspective. Br J Can- CS, Bongo LA, Kristensen VN et al (2015)
cer 104(6):893–898 Targeted exploration and analysis of large
59. Tumilson CA, Lea RW, Alder JE, Shaw L cross-platform human transcriptomic com-
(2014) Circulating microRNA biomarkers pendia. Nat Methods 12(3):211–214
for glioma and predicting response to therapy. 70. Enright A, John B, Gaul U, Tuschl T,
Mol Neurobiol 50(2):545–558 Sander C, Marks D (2003) MicroRNA targets
60. Mazan-Mamczarz K, Gartenhaus RB (2013) in Drosophila. Genome Biol 5(1):R1
Role of microRNA deregulation in the patho- 71. John B, Enright AJ, Aravin A, Tuschl T,
genesis of diffuse large B-cell lymphoma Sander C, Marks DS (2004) Human micro-
(DLBCL). Leuk Res 37(11):1420–1428 RNA targets. PLoS Biol 2(11):e363
61. Maugeri-Sacca M, Coppola V, Bonci D, De 72. Betel D, Wilson M, Gabow A, Marks DS,
Maria R (2012) MicroRNAs and prostate can- Sander C (2008) The microRNA.org
cer: from preclinical research to translational resource: targets and expression. Nucleic
oncology. Cancer J 18(3):253–261 Acids Res 36(Database Issue):D149–D153
62. Iorio MV, Croce CM (2012) MicroRNA dys- 73. Lewis BP, Burge CB, Bartel DP (2005) Con-
regulation in cancer: diagnostics, monitoring served seed pairing, often flanked by adeno-
and therapeutics. A comprehensive review. sines, indicates that thousands of human
EMBO Mol Med 4(3):143–159 genes are microRNA targets. Cell 120
63. Boeri M, Verri C, Conte D, Roz L, Modena P, (1):15–20
Facchinetti F, Calabro E, Croce CM, 74. Lewis BP, Shih IH, Jones-Rhoades MW, Bar-
Pastorino U, Sozzi G (2011) MicroRNA sig- tel DP, Burge CB (2003) Prediction of mam-
natures in tissues and plasma predict develop- malian microRNA targets. Cell 115
ment and prognosis of computed tomography (7):787–798
detected lung cancer. Proc Natl Acad Sci U S 75. Krek A, Grun D, Poy MN, Wolf R,
A 108(9):3713–3718 Rosenberg L, Epstein EJ, MacMenamin P,
64. Cava C, Bertoli G, Ripamonti M, Mauri G, da Piedade I, Gunsalus KC, Stoffel M et al
Zoppis I, Della Rosa PA, Gilardi MC, Casti- (2005) Combinatorial microRNA target pre-
glioni I (2014) Integration of mRNA expres- dictions. Nat Genet 37(5):495–500
sion profile, copy number alterations, and 76. Grun D, Wang YL, Langenberger D, Gunsa-
microRNA expression levels in breast cancer lus KC, Rajewsky N (2005) microRNA target
to improve grade definition. PLoS One 9(5): predictions across seven Drosophila species
e97681 and comparison to mammalian targets. PLoS
65. Weiler J, Hunziker J, Hall J (2006) Anti- Comput Biol 1(1):e13
miRNA oligonucleotides (AMOs): ammuni- 77. Lall S, Grun D, Krek A, Chen K, Wang YL,
tion to target miRNAs implicated in human Dewey CN, Sood P, Colombo T, Bray N,
disease? Gene Ther 13(6):496–502 Macmenamin P et al (2006) A genome-wide
66. Leivonen SK, Sahlberg KK, Makela R, Due map of conserved microRNA targets in
EU, Kallioniemi O, Borresen-Dale AL, Perala C. elegans. Curr Biol 16(5):460–471
M (2014) High-throughput screens identify 78. Maragkakis M, Alexiou P, Papadopoulos GL,
microRNAs essential for HER2 positive Reczko M, Dalamagas T, Giannopoulos G,
breast cancer cell growth. Mol Oncol 8 Goumas G, Koukis E, Kourtis K, Simossis
(1):93–104 VA et al (2009) Accurate microRNA target
prediction correlates with protein repression 89. McDermott JE, Costa M, Janszen D,
levels. BMC Bioinformatics 10:295 Singhal M, Tilton SC (2010) Separating the
79. Paraskevopoulou MD, Georgakilas G, drivers from the driven: integrative network
Kostoulas N, Vlachos IS, Vergoulis T, and pathway approaches aid identification of
Reczko M, Filippidis C, Dalamagas T, Hatzi- disease biomarkers from high-throughput
georgiou AG (2013) DIANA-microT web data. Dis Markers 28(4):253–266
server v5.0: service integration into miRNA 90. Baudot A, Real FX, Izarzugaza JMG, Valencia
functional analysis workflows. Nucleic Acids A (2009) From cancer genomes to cancer
Res 41(W1):W169–W173 models: bridging the gaps. EMBO Rep 10
80. Ekimler S, Sahin K (2014) Computational (4):359–366
methods for microRNA target prediction. 91. Chin L, Hahn WC, Getz G, Meyerson M
Genes (Basel) 5(3):671–683 (2011) Making sense of cancer genomic
81. Hanahan D, Weinberg RA (2011) Hallmarks data. Genes Dev 25(6):534–555
of cancer: the next generation. Cell 144 92. Dvinge H, Git A, Graf S, Salmon-Divon M,
(5):646–674 Curtis C, Sottoriva A, Zhao Y, Hirst M,
82. Hyman E, Kauraniemi P, Hautaniemi S, Armisen J, Miska EA et al (2013) The shaping
Wolf M, Mousses S, Rozenblum E, and functional consequences of the micro-
Ringnér M, Sauter G, Monni O, Elkahloun RNA landscape in breast cancer. Nature 497
A et al (2002) Impact of DNA amplification (7449):378–382
on gene expression patterns in breast cancer. 93. Curtis C, Shah SP, Chin SF, Turashvili G,
Cancer Res 62(21):6240–6245 Rueda OM, Dunning MJ, Speed D, Lynch
83. Bergamaschi A, Kim YH, Wang P, Sørlie T, AG, Samarajiwa S, Yuan Y et al (2012) The
Hernandez-Boussard T, Lonning PE, genomic and transcriptomic architecture of
Tibshirani R, Børresen-Dale A-L, Pollack JR 2,000 breast tumours reveals novel sub-
(2006) Distinct patterns of DNA copy num- groups. Nature 486(7403):346–352
ber alteration are associated with different 94. Hernández Patiño CE, Jaime-Muñoz G,
clinicopathological features and gene- Resendis-Antonio O (2013) Systems biology
expression subtypes of breast cancer. Genes of cancer: moving toward the integrative
Chromosom Cancer 45(11):1033–1040 study of the metabolic alterations in cancer
84. Aure MR, Leivonen SK, Fleischer T, Zhu Q, cells. Front Physiol 3:481
Overgaard J, Alsner J, Tramm T, Louhimo R, 95. The Cancer Genome Atlas Network (2012)
Alnæs GI, Per€al€a M, Busato F, Touleimat N, Comprehensive molecular portraits of human
Tost J, Børresen-Dale AL, Hautaniemi S, breast tumours. Nature 490(7418):61–70
Troyanskaya OG, Lingjærde OC, Sahlberg 96. Muniategui A, Pey J, Planes FJ, Rubio A
KK, Kristensen VN (2013) Individual and (2012) Joint analysis of miRNA and mRNA
combined effects of DNA methylation and expression data. Brief Bioinform 14
copy number alterations on miRNA expres- (3):263–278
sion in breast tumors. Genome Biol 14(11): 97. Chi SW, Zang JB, Mele A, Darnell RB (2009)
R126 Argonaute HITS-CLIP decodes microRNA-
85. Lahti L, Sch€afer M, Klein H-U, Bicciato S, mRNA interaction maps. Nature 460
Dugas M (2012) Cancer gene prioritization (7254):479–486
by integrative analysis of mRNA expression 98. Aure MR, Jernstrom S, Krohn M, Vollan H,
and DNA copy number data: a comparative Due E, Rodland E, Karesen R, Ram P, Lu Y,
review. Brief Bioinform 14(1):27–35 Mills G et al (2015) Integrated analysis reveals
86. Louhimo R, Lepikhova T, Monni O, Hauta- microRNA networks coordinately expressed
niemi S (2012) Comparative analysis of algo- with key proteins in breast cancer. Genome
rithms for integration of copy number and Med 7(1):21
expression data. Nat Methods 9(4):351–355 99. Hertel J, Lindemeyer M, Missal K, Fried C,
87. Huang N, Shah PK, Li C (2011) Lessons Tanzer A, Flamm C, Hofacker I, Stadler P,
from a decade of integrating cancer copy Students of Bioinformatics Computer Labs
number alterations with gene expression pro- 2004 and 2005 (2006) The expansion of the
files. Brief Bioinform 13(3):305–316 metazoan microRNA repertoire. BMC Geno-
88. Zhang S, Liu C-C, Li W, Shen H, Laird PW, mics 7:25
Zhou XJ (2012) Discovery of multi- 100. Griffiths-Jones S, Saini HK, van Dongen S,
dimensional modules by integrative analysis Enright AJ (2008) miRBase: tools for micro-
of cancer genomic data. Nucleic Acids Res RNA genomics. Nucleic Acids Res 36(Data-
40(19):9379–9391 base issue):D154–D158
101. Inui M, Martello G, Piccolo S (2010) Micro- invasive breast cancer. Cell Rep 16
RNA control of signal transduction. Nat Rev (4):1166–1179
Mol Cell Biol 11(4):252–263 111. Tahiri A, Leivonen SK, Luders T, Steinfeld I,
102. Bentwich I (2005) Prediction and validation Aure MR, Geisler J, Makela R, Nord S, Riis
of microRNAs and their targets. FEBS Lett MLH, Yakhini Z et al (2014) Deregulation of
579(26):5904–5910 cancer-related miRNAs is a common event in
103. Patnaik SK, Dahlgaard J, Mazin W, both benign and malignant human breast
Kannisto E, Jensen T, Knudsen S, Yendamuri tumors. Carcinogenesis 35(1):76–85
S (2012) Expression of microRNAs in the 112. Callari M, Dugo M, Musella V, Marchesi E,
NCI-60 cancer cell-lines. PLoS One 7(11): Chiorino G, Grand MM, Pierotti MA, Dai-
e49918 done MG, Canevari S, De Cecco L (2012)
104. Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Comparison of microarray platforms for mea-
Lamb J, Peck D, Sweet-Cordero A, Ebert BL, suring differential microRNA expression in
Mak RH, Ferrando AA et al (2005) Micro- paired normal/cancer colon tissues. PLoS
RNA expression profiles classify human can- One 7(9):e45105
cers. Nature 435(7043):834–838 113. Backes C, Sedaghat-Hamedani F, Frese K,
105. Selbach M, Schwanhausser B, Thierfelder N, Hart M, Ludwig N, Meder B, Meese E, Keller
Fang Z, Khanin R, Rajewsky N (2008) Wide- A (2016) Bias in high-throughput analysis of
spread changes in protein synthesis induced miRNAs and implications for biomarker stud-
by microRNAs. Nature 455(7209):58–63 ies. Anal Chem 88(4):2088–2095
106. Creixell P, Schoof EM, Erler JT, Linding R 114. Meiri E, Mueller WC, Rosenwald S,
(2012) Navigating cancer network attractors Zepeniuk M, Klinke E, Edmonston TB,
for tumor-specific therapy. Nat Biotechnol 30 Werner M, Lass U, Barshack I, Feinmesser
(9):842–848 M et al (2012) A second-generation micro-
107. Avraham R, Yarden Y (2012) Regulation of RNA-based assay for diagnosing tumor tissue
signalling by microRNAs. Biochem Soc Trans origin. Oncologist 17(6):801–812
40(1):26–30 115. Garzon R, Marcucci G, Croce CM (2010)
108. Mitchell PS, Parkin RK, Kroh EM, Fritz BR, Targeting microRNAs in cancer: rationale,
Wyman SK, Pogosova-Agadjanyan EL, strategies and challenges. Nat Rev Drug Dis-
Peterson A, Noteboom J, O’Briant KC, cov 9(10):775–789
Allen A et al (2008) Circulating microRNAs 116. Thorsen SB, Obad S, Jensen NF, Stenvang J,
as stable blood-based markers for cancer Kauppinen S (2012) The therapeutic poten-
detection. Proc Natl Acad Sci U S A 105 tial of microRNAs in cancer. Cancer J 18
(30):10513–10518 (3):275–284
109. Haakensen VD, Nygaard V, Greger L, Aure 117. Aagaard L, Rossi JJ (2007) RNAi therapeu-
MR, Fromm B, Bukholm IR, Luders T, Chin tics: principles, prospects and challenges. Adv
SF, Git A, Caldas C et al (2016) Subtype- Drug Deliv Rev 59(2–3):75–86
specific micro-RNA expression signatures in 118. Yan LX, Wu QN, Zhang Y, Li YY, Liao DZ,
breast cancer progression. Int J Cancer 139 Hou JH, Fu J, Zeng MS, Yun JP, Wu QL et al
(5):1117–1128 (2011) Knockdown of miR-21 in human
110. Lesurf R, Aure MR, Mork HH, Vitelli V, Oslo breast cancer cell lines inhibits proliferation,
Breast Cancer Research Consortium, in vitro migration and in vivo tumor growth.
Lundgren S, Borresen-Dale AL, Breast Cancer Res 13(1):R2
Kristensen V, Warnberg F, Hallett M et al 119. Lujambio A, Lowe SW (2012) The microcos-
(2016) Molecular features of subtype-specific mos of cancer. Nature 482(7385):347–355
progression from ductal carcinoma in situ to
Chapter 5
Identifying Genetic Dependencies in Cancer by Analyzing

siRNA Screens in Tumor Cell Line Panels
James Campbell, Colm J. Ryan, and Christopher J. Lord
Abstract
Loss-of-function screening using RNA interference or CRISPR approaches can be used to identify genes
that specific tumor cell lines depend upon for survival. By integrating the results from screens in multiple
cell lines with molecular profiling data, it is possible to associate the dependence upon specific genes with
particular molecular features (e.g., the mutation of a cancer driver gene, or transcriptional or proteomic
signature). Here, using a panel of kinome-wide siRNA screens in osteosarcoma cell lines as an example, we
describe a computational protocol for analyzing loss-of-function screens to identify genetic dependencies
associated with particular molecular features. We describe the steps required to process the siRNA screen
data, integrate the results with genotypic information to identify genetic dependencies, and finally the
integration of protein-protein interaction data to interpret these dependencies.
Key words Cancer, siRNA screening, Synthetic lethality
1 Introduction
Recent large-scale sequencing projects and decades of small-scale

studies have led to the identification of hundreds of “driver” genes
in cancer—genes whose alteration through genetic or epigenetic
means provides a growth or survival advantage for tumor cells
[1, 2]. A key remaining challenge is to understand how these driver
mutations alter cellular states to promote tumor progression and
how this altered state may be exploited for the development of
targeted therapeutics [3]. Identifying the set of genes that are
required for growth in a given tumor cell line provides both an
insight into the cellular state and suggests genes whose products
may be targeted therapeutically. Toward this end, a number of
laboratories have used loss-of-function screening to generate
resources describing the genetic requirements of panels of tumor
cell lines [4–11]. The majority of these resources use either
James Campbell and Colm J. Ryan contributed equally to this work.
https://doi.org/10.1007/978-1-4939-7493-1_5, © The Author (s) 2018
83
84 James Campbell et al.
genome-scale shRNA screens carried out in a pooled format [6, 7,

10] or siRNA screens carried out in an arrayed format [4, 5, 11] to
identify genetic dependencies. In the near future CRISPR-based
approaches will likely be used for similar purposes, although to date
the number of cell lines profiled by genome-wide CRISPR libraries
remains small (e.g., five cell lines in [8]). Regardless of the experi-
mental methodology used, the goal of loss-of-function screens is
largely the same—the identification of genes required for growth in
specific cancer cell lines. By integrating the results of these screens
with genotypic data, it is possible to identify genes that appear
specifically required for growth in the presence of a particular driver
gene mutation. In some cases the driver gene mutation results in an
increased dependency upon the gene itself, a phenomenon known
as “oncogene addiction” [12]. Examples of this include an
increased sensitivity of ERBB2-amplified breast cancer cell lines to
siRNA reagents targeting ERBB2 [4], and an increased sensitivity
of KRAS mutant cell lines to shRNA reagents targeting KRAS
[7]. More frequent are instances where the driver gene and the
resulting dependency gene are different, often termed
non-oncogene addictions or synthetic lethalities [12, 13]. Examples
of non-oncogene addictions identified from loss-of-function
screens include a dependence of ARID1A mutant cell lines upon
the ARID1A paralog ARID1B [14], an increased sensitivity of
PTEN mutant breast cancer cell lines to inhibition of the mitotic
kinase TTK [4], and an increased sensitivity of MYC amplified
breast cancer cell lines to inhibition of multiple spliceosome com-
ponent coding genes [15]. Ultimately both oncogene addictions
and synthetic lethalities identified in these screens may be exploited
for the development of novel targeted therapeutics in cancer [13].
When these screens are analyzed, statistical approaches are used
to identify significant associations between the mutation of a driver
gene and an increased sensitivity to the inhibition of another gene.
The interpretation of the resulting associations remains challeng-
ing—the statistical tests provide information on which genes are
required in the presence of specific driver genes, but not the mech-
anistic explanation as to why these dependencies exist. Inspired by
approaches initially developed for the interpretation of genetic
interactions in yeast [16], we have recently used the integration of
functional interaction networks to aid the interpretation of depen-
dencies identified in loss-of-function screens in cancer cell lines
[5]. For instance in ERBB2-amplified cell lines we see an increased
dependency upon ERBB2 itself and also the ERBB2 protein-
interaction partners ERBB3 and PIK3CA [5]. This suggests that
ERBB2 amplified cell lines are frequently “addicted” to the func-
tionality of ERBB2, the binding of ERBB2 to its interaction partner
ERBB3, and the function of the downstream effector PIK3CA.
Here, we describe a protocol for the analysis of loss-of-function
screens in a panel of cancer cell lines. We use as example data a
Analyzing siRNA Screens in Tumor Cell Line Panels 85
recent kinome-wide siRNA screen performed in a panel of osteo-

sarcoma cell lines [5]. Our analysis protocol involves three main
steps:
1. The conversion of siRNA screening results into gene-sensitivity
scores.
2. The integration of these sensitivity scores with genotypic data
to identify statistical associations between driver genes and
sensitivity to the inhibition of particular genes.
3. The integration of additional data such as protein-protein
interactions to interpret these associations.
Only the first step is specific to arrayed siRNA screens—we have
successfully applied the latter analysis scripts to data resulting from
additional screen types (e.g., pooled shRNA screens) (Fig. 1).
2 Materials
2.1 Software (See 1. R (available from https://www.r-project.org/).

Notes 1 and 2) 2. R-packages:
(a) Gplots (see Note 3).
(b) cellHTS2 (see Note 4).
3. Python programming language (available from https://www.
python.org/).
4. git repository containing the statistical analyses, code, and data
resources discussed in the text (see Note 5) https://github.
com/GeneFunctionTeam/identifying-genetic-dependencies
2.2 Input Files 1. Plate files (txt) contain the output from a loss of function
screen. These each comprise three tab-separated columns of
data containing the plate number (numeric), well position
(e.g., B07), and the response value for the cell (e.g., luminosity
readout). See the CellHTS2 documentation for further
information.
2. Plate file list. This file contains three tab-separated columns
with a header row listing “Filename,” “Plate,” and “Replicate.”
Filenames correspond to each plate file. The plate column
defines which plate in the plate configuration file the data
correspond to. The replicates column defines, which replicate
a plate represents. See the CellHTS2 documentation for fur-
ther information.
3. Plate configuration file. The first line defines the number of
wells in each plate (e.g., “Wells: 384”). The second line defines
the number of plates in the library (e.g., “Plates: 3”). The third
line is a header associated with the subsequent columns (e.g.,
A
Cell Line 1 2 3 N
Kinase1 -3 0 0 -1
Kinase2 0 1 -2 0
Kinase N 0.5 -2 1 -5
Process using
Plate Arrayed CellHTS2 / R
siRNA Screens Z-score Table (Step 3.1)
(Step 3.1)
B Cell Line 1 2 3 N Cell Line 1 2 3 4 5 6 7 8 9
+
Kinase1 -3 0 0 -1
RB1 1 1 0 1 0 0 0 0 0
Kinase2 0 1 -2 0
CDKN2A 0 0 1 0 1 1 1 0 0
Kinase N 0.5 -2 1 -5
Mutations Table
Z-score Table
Perform association
analysis using R (Step 3.2)
Marker Target P-value

RB1 DYRK1A 0.005
CDKN2A BRAF 0.600
Associations Table
C
Marker Target P-value
RB1
CDKN2A
DYRK1A
BRAF
0.005
0.600
+ Protein-protein
Interaction Network
Associations Table
Annotate dependencies
using Python (Step 3.3)
Marker Target P-value PPI

RB1 DYRK1A 0.005 True
CDKN2A BRAF 0.600 False
Annotated Associations Table
Fig. 1 Analyzing siRNA screens in Tumor Cell Line Panels. (a) Luminescence values derived from pooled siRNA
screens are converted into Z-scores using CellHTS2 and custom R scripts. (b) Z-score profiles for each cell line
are integrated with mutational profiles for the same set of cell lines using R. Custom R scripts are used to
identify associations between the presence of particular mutations (e.g., in the RB1 gene) with increased
“Plate,” “Well,” “content”). The remaining lines define the

wells containing samples and controls. An asterisk character
(*) can be used to mean “all plates or wells.” E.g., “* * sample”
indicates that the all plates and all wells are “sample” unless
otherwise stated. Subsequent more specific lines update the
contents of other wells. E.g. “* [A-P]01 empty” indicates
that on all plates (*), every row ([A-P]) of the first column
(01) is marked as “empty.” When defining wells as containing
controls, ensure the case of the text used matches that used
elsewhere. For further details on the plate configuration file,
see the cellHTS2 documentation.
4. Plate annotation file. Contains at least three columns with a
header. The first two columns list the plate and well IDs used in
the library. The third and subsequent columns list annotations
(such as the ID of the gene targeted by an siRNA). For further
details on the plate configuration file, see the cellHTS2
documentation.
5. File containing functional relationships between genes (see
Note 6).
3 Methods
3.1 Processing Typically, siRNA screens are conducted in multiwell tissue culture
siRNA Screen Data plates. The process of transfecting a cancer cell line with siRNAs is
Using CellHTS2 optimized prior to screening and once optimal conditions have
been selected (described in [17]), cells are dispensed into multiwell
plates containing growth media, transfection reagents, and siRNAs.
The data in the example provided represent a screen of a single
osteosarcoma tumor cell line using an siRNA library targeting
714 kinase and kinase-related genes. Positive and negative controls
are included on each plate—typically non-targeting siRNA as a
negative control and an siRNA pool targeting PLK1 as a positive
control. The full experimental protocol for this screen has been
described elsewhere [4, 5]. Briefly, following siRNA transfection,
the cells were cultured for 5 days, after which a luminescence assay
measuring cellular ATP was used to estimate cell viability. A Victor
X5 platereader was used to read luminescence values, resulting in
data files in Microsoft Excel format. Prior to the analysis in R, these
data files were converted to plain text plate files. Each plate file
contains the luminescence reading from each well in one 96 or
Fig. 1 (continued) sensitivity to siRNAs targeting specific genes (e.g., DYRK1A). (c) The associations table is
integrated with a data file describing known protein-protein interactions using Python. This results in a table of
annotated dependencies—indicating whether a given association occurs between a pair of genes whose
protein products are known to physically interact
384 multiwell plate. Where an siRNA library is larger than the plate
format used in the screen, several plates are required for a single
screen. Additionally, multiple replicate screens are typically con-
ducted for a given cell line and siRNA library. The organization of
plates into segments of an siRNA library and replicate screens is
described in a plate list file. A plate list file contains the file names of
the plate files, the replicate numbers, and plate numbers in a multi-
plate screen. Annotations indicating the genes targeted by siRNAs
in the library across multiple plates as well as the positions of
control wells are provided in separate plain text files. The analysis
protocol set out below uses the cellHTS2 [18] R package devel-
oped by Huber and Boutros to combine data from the plate files,
the plate list file, the plate configuration file, and the annotation file.
The luminescence data are normalized to produce Z-scores by first
log2 transforming the values and subtracting the median log lumi-
nescence value on a plate-by-plate basis. The plate-centered data are
then scaled to the median absolute deviation (MAD) value calcu-
lated across the entire siRNA library to produce Z-scores.
An R script named “run_cellHTS.R” in the R-scripts directory
contains the following commands. The first command loads the
cellHTS2 R package that provides the functions required for the
analysis.
require(cellHTS2)
With cellHTS2 loaded, we then use the readPlateList() func-

tion to read the plate list file which in turn creates a cellHTS object
containing the luminescence data from the plate files (see Note 7).
x <- readPlateList(
filename¼" platelist_p3r3.txt",
name¼"CGDsExample"
path¼"./"
)
We next use the configure() function to add information from

the plate configuration file and (optionally) the screen log and
description file to the cellHTS object. The plate configuration
defines the locations of samples, controls and empty wells.
x <- configure(
x,
descripFile¼"screen_description.txt",
confFile¼"plateconf_384.txt",
logFile¼"Screenlog.txt",
path¼"./"
)
We use the annotate() function to define the genes targeted by

siRNAs in each well of the plate. This information is located in the
“kinome_library.txt” file.
x <- annotate(
x,
geneIDFile¼"kinome_library.txt",
path¼"./"
)
We now process the luminescence data in the cellHTS object to

normalize data values across the plates in the screen. This is done by
log2 transforming the luminescence values and subtracting the
median value within a plate from all the values of wells in that
plate. The parameters passed to the normalizePlates() function are
described in Note 8. The original cellHTS object “x” is passed to
the normalizePlates() function and the result is saved into a new
cellHTS object called “xn.”
xn <- normalizePlates(
x,
scale¼"multiplicative",
log¼TRUE,
method¼"median",
varianceAdjust ¼ "none",
negControls¼"neg",
posControls¼"pos"
)
The normalized data stored in “xn” are then scaled by dividing

each well’s value by the median absolute deviation (MAD) calcu-
lated from the normalized values across the whole siRNA library.
Control wells are excluded from the estimation of the MAD. Scal-
ing the plate median centered normalized data by the MAD pro-
duces the robust equivalent of Studentized values or Z-scores (see
Note 9).
xsc <- scoreReplicates(

xn,
method¼"zscore",
sign¼"þ"
)
For later statistical analyses, it may be preferable to summarize

the values of replicate wells targeting a specific gene as a median or
some other summary statistic. This can be performed using the
summarizeReplicates() function in cellHTS2.
xsc <- summarizeReplicates(

xsc,
summary¼"median"
)
CellHTS2 also provides a function called getTopTable() that

writes a plain text file containing the well annotation data as well as
the luminescence data at each stage of processing. Here, we write
this information to a file called “TopTable.txt.”
summary_info <- getTopTable(

list(
"raw"¼x,
"normalized"¼xn,
"scored"¼xsc
),
file¼"TopTable.txt"
)
An HTML formatted report can also be generated describing

the screen and the processing steps applied to it using the commands
below. This HTML report provides information on the positive and
negative controls included, the distribution of the resulting scores,
and details of the quality of the screen (Z0 scores, see below).
The contents of the HTML report can be modified using the
setSettings() function. Here, we turn on the reproducibility and inten-
sities reports (producing heatmap visualizations of well values) and set
the range of heatmap colors for the screen summary scores report.
setSettings(
list(
plateList¼list(
reproducibility¼list(
include¼TRUE,
map¼TRUE
),
intensities¼list(
include¼TRUE,
map¼TRUE)
),
screenSummary¼list(
scores¼list(
range¼c(-20, 10),
map¼TRUE
)
)
)
)
We then use the writeReport() function to generate the HTML

report.
writeReport(
raw¼x,
normalized¼xn,
scored¼xsc,
outdir¼./report,
force¼TRUE,
posControls¼"pos",
negControls¼"neg",
mainScriptFile¼"../R-scripts/run_cellHTS.R"
)
The outputs from this cellHTS2 analysis so far have been a

TopTable plain text file and a folder containing an HTML report. It
is possible to extract any data in the cellHTS objects using accessor
methods in order to produce customized outputs. Here, we extract
information on the targeted genes, the plate numbers, well num-
bers, and median Z-scores and combine this into a data frame
(“combinedz”) containing four columns (compound, plate, well,
and zscore).
genes <- geneAnno(xsc)

plates <-plate(xsc)
wells <- well(xsc)
scores <- Data(xsc)[,1,1]
combinedz <- data.frame(
compound¼compounds,
plate¼plates,
well¼wells,
zscore¼scores
)
We can then write out the “combined” data frame to a text file.
A use case for this is to enable joining data from multiple screens
into a single file for analysis.
write.table(
combinedz,
"zscore.txt",
sep¼"\t",
quote¼FALSE,
row.names¼FALSE
)
This analysis needs to be performed for each screen in the

experiment. Typically, multiple distinct screens would represent
multiple tumor cell lines screened with a specific library of siRNAs.

Quality control steps need to be applied on a screen-by-screen
basis. We expect siRNA screen replicates to be strongly correlated
and reject screens where no pairs of replicates have a correlation
coefficient greater than 0.7 (see Note 10).
In an earlier step, we saved the output from the getTopTable()
function to a data frame called “summary_info.” We can extract the
replicate normalized luminescence values from this data frame and
calculate the Pearson correlation coefficients for each pair of repli-
cates using the following command.
cor(
summary_info[,c(
"normalized_r1_ch1",
"normalized_r2_ch1",
"normalized_r3_ch1"
)],
use¼"pairwise.complete.obs"
)
A further quality control step that is recommended is to exam-

ine the Z-prime (Z0 ) values for each screen [19]. Z0 scores provide a
measure of the separation of the positive and negative control
siRNAs included in a screen and so can be considered an estimate
of how much it is possible for the individual “sample” wells to vary
in Z-scores. Larger values of Z0 indicate better screens. Screens with
Z0 values 0.5 are considered excellent. Those with Z0 values 0 are
considered unusable and should be rejected and the experiments
should be repeated. CellHTS2 calculates Z0 scores for each replicate
and these can be found in the HTML report under the “plate
summaries” section.
3.2 Identification We next integrate the processed results from multiple siRNA
of Kinase screens with data describing the genetic alterations present in each
Dependencies sample. For this tutorial we use the siRNA data from 18 osteosar-
Associated with Driver coma tumor cell lines and a mutations file that describes the pres-
Gene Mutation or Copy ence or absence of genetic alterations in different members of the
Number Alteration Retinoblastoma (RB1) pathway. In the git repository downloaded,
there is a set of directories containing pre-formatted siRNA and
mutation datasets as well as R scripts to process the data. Open the
script R-scripts/identifying_CGDs_RB1_osteosarcoma.R and
examine its contents. The first command sets the working directory
to the top level of the git repository we cloned/downloaded earlier.
Modification of the path given to the setwd() function is required
to point to the appropriate location on your local system.
setwd("~/software/identifying-genetic-dependencies")
The next command runs R code contained in a second file in

the R-scripts directory. The dot at the beginning of the path
indicates that the path is relative to the current working directory.
The file “identifying_CGDs_library.R” contains a set of functions
that abstract the process of loading mutation and siRNA datasets as
well as running a set of statistical tests. Readers familiar with R can
examine the code in this file to understand the individual analysis
steps in more detail.
source("./R-scripts/identifying_CGDs_library.R")
We next define the paths to the siRNA and mutation data files
used in the analysis. It is a helpful to define this kind of information
near the top of scripts so that in the future the files can be changed
without having to find the commands where these values are used.
sirna_screens_file <- "./siRNA-data/Osteosarcoma_kinome_sc-

reens.txt"
rb_pathway_func_muts_file <- "./mutation-data/combined_exo-
me_cnv_func_muts_RBpathway_160418.txt"
rb_pathway_all_muts_file <- "./mutation-data/combined_exo-
me_cnv_all_muts_RBpathway_160418.txt"
The next command reads the siRNA and mutation datasets,

identifies cell lines in common between each dataset, and returns an
R list object containing analysis-ready tables. The input files com-
prise tab-separated data where the first row and first column repre-
sent column and row names respectively. Aside from the first row
(column headings), each row contains data for a single-cell line.
Each column represents a property measured across each cell line.
In the “sirna_screens_file,” these properties are the Z-scores repre-
senting the relative viability of cells treated with siRNAs targeting
specific genes. In the case of the mutation datasets (rb_pathway_-
func_muts_file and rb_pathway_all_muts_file), these properties
represent the presence or absence of a driver gene alteration. The
file rb_pathway_func_muts_file contains a “1” where a cell line is
considered to contain a likely functional cancer driver gene alter-
ation (mutation or copy number alteration) and a “0” where such a
change is absent. Similarly, the file rb_pathway_all_muts_file con-
tains a “1” or “0” to indicate the presence of any driver gene
alteration found in a cell line irrespective of presumed functional
impact. These two files are used to identify sets of cell lines where a
driver gene is considered to be functionally altered (the mutant
group) or where alterations to the driver gene are entirely absent
(the wild-type group) (see Note 11).
kinome_rb_muts <- read_rnai_mutations(

rnai_file¼sirna_screens_file,
func_muts_file¼rb_pathway_func_muts_file,
all_muts_file¼rb_pathway_all_muts_file
)
With the siRNA and mutation data tables organized within

kinome_rb_muts, we now run association tests between mutations
or copy number alterations in RB1 pathway genes and test depen-
dency on each gene targeted in the kinome siRNA library. The
function run_univariate_tests() performs Wilcoxon Rank Sum
tests between siRNA Z-scores of cell lines in the mutant and wild-
type groups and returns a table of these test results as well as other
information such as descriptive statistics (including the median Z-
score of the mutant and wild-type group and the difference
between those two values).
kinome_rb_mut_associations <- run_univariate_tests(

zscores¼kinome_rb_muts$rnai,
mutations¼kinome_rb_muts$func_muts,
all_variants¼kinome_rb_muts$all_muts,
alt¼"less"
)
We write out the results of the association tests to a text file that
can be opened in a spreadsheet application or used as input for
other programs such as the annotate_dependencies.py python pro-
gram described in Subheading 3.3.
write.table(
kinome_rb_mut_associations,
"./results/kinome_rb_mut_associations.txt",
sep¼"\t",
col.names¼TRUE,
row.names¼FALSE,
quote¼FALSE
)
3.3 Annotating In the absence of additional information, interpreting an associa-

Molecular tion between the mutation of a driver gene and sensitivity to RNAi
Dependencies reagents targeting another gene can be difficult. One approach to
According to Known aiding the interpretation of these associations is the integration of
Functional orthogonal data, including known functional relationships between
Relationships genes or their protein products. We provide a simple Python script
(annotate_dependencies.py) that can be used to integrate known
functional relationships (e.g., protein-protein, kinase-substrate, or
gene-regulatory interactions) with the associations generated by
the R scripts described in Subheading 3.2. This script adds an
additional column to the associations file indicating whether or
not the marker-target gene pair has a known functional relationship

according to a user-supplied source of interactions.
1. Create a file containing functional relationships between genes
(see Note 6 for potential sources of these relationships). Each
line of this file should contain two gene symbols (HUGO gene
names) separated by a tab. Alternatively, files in the BioGRID
Tab 2.0 Format, such as those downloaded from the BioGRID
database [20], can be used as input.
2. Open a command prompt/terminal and run the script as
follows:
python annotate_dependencies.py -a <associations> -o <output>

-i <interactions> -n <column_name>
where <associations> is the name of the associations file cre-

ated using the R scripts above, <output> is the name of the file
where the annotated associations will be output to, <interactions>
is the name of the file containing known functional relationships,
and <column_name> is an optional name for the column where
the functional annotation will be stored. If the interactions file is in
the BioGRID Tab 2.0 format then add the optional “–b” argument
to this command. See Note 12 for additional parameters of this file.
3. View the resulting output in a text editor or spread sheet
application. There should be an additional column in the file
named using the <column_name> argument, with True or
False values indicating whether each marker-target association
involves a gene pair with a known functional relationships
according to the <interactions> file
4. Additional columns can be added (e.g., to annotate the asso-
ciations according to a different source of interactions) by
running the script again using the output file (<output>) of
the first run as input to a subsequent run. For this step it is
necessary to set the <column_name> parameter to avoid over-
writing previous results.
The end result of this analysis is a file containing an annotated
list of associations between a particular genomic feature (indicated
in the “marker” column) and increased sensitivity to siRNA
reagents targeting a particular gene (indicated in the “target” col-
umn). The column titled “PPI” in this file indicates whether the
marker gene and the target have a known functional relationship
(e.g., protein-protein interaction) while the column “wilcox.p”
gives an indication of the statistical significance of the association.
These p-values, together with the annotation of known functional
relationships, may be used to prioritize candidate genetic depen-
dencies (synthetic lethalities) for follow-up experiments. At a mini-
mum these follow-up experiments should involve using orthogonal
means to test the observed association (e.g., alternative siRNA

reagents or a small molecule targeting the protein product of the
gene of interest) [21]. Ideally, the follow-up validation would test
the association in additional cell lines harboring the mutation of
interest. In the example provided, we found that RB1 mutation is
associated with increased sensitivity to siRNA targeting the kinase
DYRK1A, a known RB1 binding partner. In Campbell et al. [5] we
validated this in a larger panel of osteosarcoma cell lines using four
distinct siRNA reagents targeting the DYRK1A gene suggesting
that the initial observation represents a real dependency.
4 Notes
1. All the analyses can be performed on a desktop computer. A

recent version of the R statistical programming environment
(available from https://www.r-project.org/) and the Python
programming language (available from https://www.python.
org/) are required. The Python scripts presented here have
been tested with Python versions 2.7 and 3.4, while R scripts
have been tested with version 3.2.5.
2. Note that we provide extensive code samples throughout this
document. In these samples the tilde character (~) is used as a
short cut to the user’s home directory on Unix-like systems.
On Microsoft Windows, the forward slash characters (/) separ-
ating the file paths will need to be substituted with back slashes
(\).
3. Gplots is provided on the Comprehensive R Archive Network
(CRAN) and can be installed by starting an R session and enter
the following code:
install.packages(
"gplots",
dependencies¼TRUE,
)
4. CellHTS2 [18] is an R package used to process RNAi screen

data and can be installed using the following command:
source("https://bioconductor.org/biocLite.R")
biocLite("cellHTS2")
5. This repository can be downloaded as a zip file by navigating to

the above URL and choosing “download ZIP.” Alternatively
install git (software available from https://git-scm.com), open
a console window, change directory to a suitable path that must

exist (e.g., cd ~/software), and enter the following command:
git clone https://github.com/GeneFunctionTeam/identifying-

genetic-dependencies
This command should create a new directory (e.g., ~/soft-

ware/identifying-genetic-dependencies) containing data and
scripts. The data files include a file containing viability data
from an siRNA screen of osteosarcoma cell lines [5] and driver
gene mutation datasets compiled from publicly available com-
pendia of mutations in tumor cell lines [22].
6. BioGRID is a database of experimentally determined molecular
interactions [20]. The web interface to BioGRID allows users
to download the entire database in Tab 2.0 format, and also the
interactions associated with a specific gene. An alternative
source is PathwayCommons [23], which integrates protein-
protein, gene-regulatory, kinase-substrate, and other molecular
relationships. More specialized data sources include Phospho-
SitePlus [24] (kinase-substrate relationships) and HINT (high-
confidence protein-protein interactions) [25].
7. Detailed instructions on how to use cellHTS2 can be found in
an R vignette titled “End-to-end analysis of cell-based
screens.” Once the cellHTS2 package is installed, the com-
mand ‘browseVignettes("cellHTS2")’ can be entered into the
R console to reveal links to this and other relevant vignettes.
8. In our experience, luminescence values from multiwell siRNA
screens tend to be positively skewed and show a log-normal
distribution. It is thus preferable to log transform values prior
to normalization. Setting the “log” argument of the “normal-
izePlates” function to “TRUE” and the “scale” argument to
“multiplicative” instructs cellHTS2 to first log transform the
luminescence values and then subtract the plate median values
from each value on a plate.
9. Z-normalization in the classical sense refers to adjusting a set of
normally distributed values such that they have a mean value of
zero and a standard deviation equal to one. For idealized
normally distributed Z-scores, 95% of the values are expected
to fall between Z ¼ 2 and Z ¼ þ2 and 99.1% of the values are
expected to fall between Z ¼ 3 and Z ¼ þ3. Log-transformed
and plate-centered luminescence values from siRNA screens
often have negatively skewed distributions that are not well
described by statistics such as the mean and standard deviation.
As an alternative to standard Z-score normalization we use
robust Z-normalization where the median value is subtracted
from all log-transformed plate-centered values and these values
are then divided by the median absolute deviation (MAD) of

the distribution. This results in approximately 95% of the values
falling between Z ¼ 2 and Z ¼ þ2. Thus, siRNAs that
produce a Z-score of <2 (or more stringently, <3) are
interpreted as causing a decrease in viability.
10. At least two replicates are required for each screen in order to
assess the overall reproducibility of the screen. We typically
perform screens using three replicates and take the median
value for each siRNA to further minimize noise.
11. Defining functional mutations in cancer driver genes can be
difficult. In some cases (e.g., amplification of a gene such as
ERBB2) the functional relevance of an alteration is well estab-
lished. In many cases however, especially those involving mis-
sense mutations, the functional relevance of an alteration is
uncertain. In [5] we developed a simple pipeline to classify
mutations and copy number changes as either of likely func-
tional relevance or of uncertain relevance [5]. For tumor
suppressor genes we classify homozygous deletions, muta-
tions predicted to cause a truncation (frame shift, nonsense,
or splice site alteration) or missense mutations found to occur
recurrently in tumors as functionally relevant. For oncogenes,
we classify amplification events or recurrent missense muta-
tions as functionally relevant. Mutations other than these are
classified as of uncertain relevance and cell lines harboring
these mutations are excluded from our association tests.
12. By default the “annotate_dependencies.py” script assumes
that the interactions provided in the input file are undirected
(i.e., the interaction (a, b) is the same as the interaction (b, a)).
Using the argument “-d” changes this default behavior such
that a directed network is utilized. This may be more appro-
priate for directed networks—e.g., for RB1 associated depen-
dencies it may make sense to highlight associations between
RB1 and genes that it regulates, but not associations involving
genes that regulate RB1.
References
1. Davoli T et al (2013) Cumulative haploinsuffi- 4. Brough R et al (2011) Functional viability pro-

ciency and triplosensitivity drive aneuploidy files of breast cancer. Cancer Discov 1
patterns and shape the cancer genome. Cell (3):260–273
155(4):948–962 5. Campbell J et al (2016) Large-scale profiling of
2. Lawrence MS et al (2014) Discovery and satu- kinase dependencies in cancer cell lines. Cell
ration analysis of cancer genes across 21 tumour Rep 14(10):2490–2501
types. Nature 505(7484):495–501 6. Cheung HW et al (2011) Systematic investiga-
3. Yaffe MB (2013) The scientific drunk and the tion of genetic vulnerabilities across cancer cell
lamppost: massive sequencing efforts in cancer lines reveals lineage-specific dependencies in
discovery and treatment. Sci Signal 6(269):e13 ovarian cancer. Proc Natl Acad Sci U S A 108
(30):12372–12377
7. Cowley GS et al (2014) Parallel genome-scale 17. Lord CJ et al (2008) A high-throughput RNA

loss of function screens in 216 cancer cell lines interference screen for DNA repair determi-
for the identification of context-specific genetic nants of PARP inhibitor sensitivity. DNA
dependencies. Sci Data 1:140035 Repair (Amst) 7(12):2010–2019
8. Hart T et al (2015) High-resolution CRISPR 18. Boutros M, Bras LP, Huber W (2006) Analysis
screens reveal fitness genes and genotype- of cell-based RNAi screens. Genome Biol 7(7):
specific cancer liabilities. Cell 163 R66
(6):1515–1526 19. Zhang JH, Chung TD, Oldenburg KR (1999)
9. Kim HS et al (2013) Systematic identification A simple statistical parameter for use in evalua-
of molecular subtype-selective vulnerabilities in tion and validation of high throughput screen-
non-small-cell lung cancer. Cell 155 ing assays. J Biomol Screen 4(2):67–73
(3):552–566 20. Chatr-Aryamontri A et al (2015) The Bio-
10. Marcotte R et al (2016) Functional genomic GRID interaction database: 2015 update.
landscape of human breast cancer drivers, vul- Nucleic Acids Res 43(Database issue):
nerabilities, and resistance. Cell 164 D470–D478
(1–2):293–309 21. Jackson AL, Linsley PS (2010) Recognizing
11. Moser R et al (2014) Functional kinomics and avoiding siRNA off-target effects for target
identifies candidate therapeutic targets in head identification and therapeutic application. Nat
and neck cancer. Clin Cancer Res 20 Rev Drug Discov 9(1):57–67
(16):4274–4288 22. Forbes SA et al (2015) COSMIC: exploring
12. Luo J, Solimini NL, Elledge SJ (2009) Princi- the world’s knowledge of somatic mutations
ples of cancer therapy: oncogene and in human cancer. Nucleic Acids Res 43(Data-
non-oncogene addiction. Cell 136 base issue):D805–D811
(5):823–837 23. Cerami EG et al (2011) Pathway Commons, a
13. Lord CJ, Tutt AN, Ashworth A (2015) Syn- web resource for biological pathway data.
thetic lethality and cancer therapy: lessons Nucleic Acids Res 39(Database issue):
learned from the development of PARP inhibi- D685–D690
tors. Annu Rev Med 66:455–470 24. Hornbeck PV et al (2015) PhosphoSitePlus,
14. Helming KC et al (2014) ARID1B is a specific 2014: mutations, PTMs and recalibrations.
vulnerability in ARID1A-mutant cancers. Nat Nucleic Acids Res 43(Database issue):
Med 20(3):251–254 D512–D520
15. Hsu TY et al (2015) The spliceosome is a ther- 25. Das J, Yu H (2012) HINT: high-quality pro-
apeutic vulnerability in MYC-driven cancer. tein interactomes and their applications in
Nature 525(7569):384–388 understanding human disease. BMC Syst Biol
16. Kelley R, Ideker T (2005) Systematic interpre- 6:92
tation of genetic interactions using protein net-
works. Nat Biotechnol 23(5):561–566
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 Interna-
tional License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation,
distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s
Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
Part II
Systems Analyses of Signaling Networks in Cancer Cells

Chapter 6
Phosphoproteomics-Based Profiling of Kinase Activities

in Cancer Cells
Jakob Wirbel, Pedro Cutillas, and Julio Saez-Rodriguez
Abstract
Cellular signaling, predominantly mediated by phosphorylation through protein kinases, is found to be
deregulated in most cancers. Accordingly, protein kinases have been subject to intense investigations in
cancer research, to understand their role in oncogenesis and to discover new therapeutic targets. Despite
great advances, an understanding of kinase dysfunction in cancer is far from complete.
A powerful tool to investigate phosphorylation is mass-spectrometry (MS)-based phosphoproteomics,
which enables the identification of thousands of phosphorylated peptides in a single experiment. Since every
phosphorylation event results from the activity of a protein kinase, high-coverage phosphoproteomics data
should indirectly contain comprehensive information about the activity of protein kinases.
In this chapter, we discuss the use of computational methods to predict kinase activity scores from
MS-based phosphoproteomics data. We start with a short explanation of the fundamental features of the
phosphoproteomics data acquisition process from the perspective of the computational analysis. Next, we
briefly review the existing databases with experimentally verified kinase-substrate relationships and present a
set of bioinformatic tools to discover novel kinase targets. We then introduce different methods to infer
kinase activities from phosphoproteomics data and these kinase-substrate relationships. We illustrate their
application with a detailed protocol of one of the methods, KSEA (Kinase Substrate Enrichment Analysis).
This method is implemented in Python within the framework of the open-source Kinase Activity Toolbox
(kinact), which is freely available at http://github.com/saezlab/kinact/.
Key words Phosphoproteomics, Mass-spectrometry, Kinase activity, Computational biology, Cancer

systems biology, Signal transduction
1 Introduction
Protein kinases are major effectors of cellular signaling, in the

context of which they form a highly complex and tightly regulated
network that can sense and integrate a multitude of external stimuli
or internal cues. This kinase network exerts control over cellular
processes of fundamental importance, such as the decision between
proliferation and apoptosis [1]. Deregulation of kinase signaling
can lead to severe diseases and is observed in almost every type of
cancer [2]. For instance, a single constitutively active kinase,
https://doi.org/10.1007/978-1-4939-7493-1_6, © The Author(s) 2018
103
104 Jakob Wirbel et al.
originating from the fusion of the BCR and ABL genes, can give
rise to and sustain chronic myeloid leukemia [3]. Accordingly, the
small molecule inhibitor of the BCR-ABL kinase, Imatinib, has
shown unprecedented therapeutic effectiveness in affected
patients [4].
Fueled by these promising clinical results, due to the essential
role for kinases in the patho-mechanism of cancer, and because
kinases are in general pharmacologically tractable [5], a range of
new kinase inhibitors has been approved or is in development for
different cancer types [6]. However, not all eligible patients
respond equally well, and in addition, cancers often develop resis-
tance to initially successful therapies. This calls for a deeper under-
standing of kinase signaling and opens up the possibility of
exploiting this knowledge therapeutically [7].
By definition, the activity of a kinase is reflected in the occur-
rence of phosphorylation events catalyzed by this kinase. Thus,
analysis of kinase activity was traditionally achieved by monitoring
the phosphorylation status of a limited number of sites known to be
targeted by the kinase of interest using immunochemical techni-
ques [8]. This, however, requires substantial prior-knowledge and
yields a comparably low throughput. Other approaches exist, e.g.,
protein kinase activity assays [9, 10] or attempts to measure kinase
activity with chromatographic beads functionalized with ATP or
small molecule inhibitors [11].
Mass spectrometry-based techniques to measure phosphoryla-
tion can identify thousands of phosphopeptides in a single sample
with ever-increasing coverage, throughput, and quality, nourished
by technological advances and dramatically increased performance
of MS instruments in recent years [12–14]. High-coverage phos-
phoproteomics data should indirectly contain information about
the activity of many active kinases. The high-content nature of
phosphoproteomics data, however, poses challenges for computa-
tional analysis. For example, only a small subset of the described
phosphorylation sites can be explicitly associated with functional
impact [15].
As a means to extract functional insight, methods to infer
kinase activities from phosphoproteomics data based on prior-
knowledge about kinase-substrate relationships have been put for-
ward [16–19]. The knowledge about kinase-substrate relation-
ships, compiled in databases like PhosphoSitePlus [20] or
Phospho.ELM [21], covers only a limited set of interactions. Alter-
natively, computational resources to predict kinase-substrate rela-
tionships based on kinase recognition motifs and contextual
information have been used to enrich the collections of substrates
per kinase [22, 23], but the accuracy of such kinase-substrate
relationships has not been validated experimentally for most cases.
The inferred kinase activities can in turn be used to reconstruct
Phosphoproteomics-Based Profiling of Kinase Activities 105
kinase network circuitry or to predict therapeutically relevant fea-

tures such as sensitivity to kinase inhibitor drugs [17].
In this chapter, we start with a brief description of phosphopro-
teomics data acquisition, highlighting challenges for the computa-
tional analysis that may arise out of the experimental process.
Subsequently, we will present different computational methods for
the estimation of kinase activities based on phosphoproteomics data,
preceded by the kinase-substrate resources these methods use. One
of these methods, namely KSEA (Kinase-Substrate Enrichment
Analysis), will be explained in more detail in the form of a guided,
stepwise protocol, which is available as part of the Python open-
source Toolbox kinact (for Kinase Activity Scoring) at http://www.
github.com/saezlab/kinact/.
2 Phosphoproteomics Data Acquisition
For a summary of technical variations or available systems for the

experimental setup of phosphoproteomics data acquisition, we
would like to refer the interested reader to dedicated publications
such as [24, 25]. We provide here a short overview about the
experimental process to facilitate the understanding of common
challenges that may arise for the data analysis that we will focus on.
Mass spectrometry-based detection of peptides with posttrans-
lational modifications (PTM) usually requires the same steps, inde-
pendent of the modification of interest: (1) cell lysis and protein
extraction with special focus on PTM preservation, (2) digestion of
proteins with an appropriate protease, (3) enrichment of peptides
bearing the modification of interest, and (4) analysis of the peptides
by LC-MS/MS [26]. After the experimental work, additional data
processing steps are required to identify the position of the modifi-
cation, e.g., the residue that is phosphorylated. For almost every
step, different protocols are available, starting from various pro-
teases for protein digestion to different data acquisition methods
for MS [24].
2.1 Phosphopeptide Naturally, the enrichment of phosphopeptides is a pivotal step for

Enrichment phosphoproteomics. Next to the enrichment method used, the
choice of the protease [27] or the MS ionization method [28] also
has an impact on the part of the phosphoproteome that is sampled.
For phosphopeptide enrichment, the field is dominated by immobi-
lized metal affinity chromatography (IMAC) and metal oxide affinity
chromatography (MOAC), which all exploit the affinity of the phos-
phorylation toward metal ions. Popular techniques include Fe3+-
IMAC, Ti4+-IMAC [29], or TiO2-MOAC. Alternatively, more tra-
ditional biochemical methods involving immunoaffinity purification
are also in use for enrichment of phosphopeptides, although these
are generally limited to studies of phosphotyrosine [30].
Of note, the different enrichment methods show little overlap

in the detected phosphopeptides, although this can also be
observed for replicates of runs using the identical enrichment
method, as discussed below [31].
After enrichment, the phosphopeptides are separated chro-
matographically, usually by reversed phase liquid chromatography
(RPLC), and then enter the mass spectrometer for tandem MS
analysis (MS/MS), completing the workflow of LC-MS/MS. Var-
iations in the chromatography method used as well as the multitude
of mass spectrometry instrument types are reviewed in detail else-
where [24]. Generally, the quality of the chromatographic separa-
tion will have a big impact on the number of phosphopeptides that
can confidently be identified. Chromatography runs of higher
quality also take more time, so that a tradeoff between resolution
and throughput must be devised for each experiment.
2.2 Data Acquisition For most phosphoproteomics studies so far, the MS instrument is
operated in the data-dependent acquisition (DDA) mode. Therein,
precursor ions from a first survey scan are selected—typically based
on relative ion abundance—in order to generate fragmentation
spectra in a second MS run [32], for which a database search yields
the corresponding peptide sequences [33]. As a result, peptide
detection in DDA is on the one hand biased toward high abun-
dance species, but also considerably irreproducible due to stochas-
tic precursor ion selection [34]. This inherent under-sampling of
DDA usually leads to missing data points in LC-MS/MS datasets.
However, this problem may be solved to some extent by extracting
ion chromatograms of the peptides that are missing in some of the
runs that are being compared [35–38], by matching across samples
[39], or with the accurate mass and retention tag method [40].
In an alternative operation mode, selected reaction monitor-
ing/multiple reaction monitoring (SRM/MRM), the presence and
abundance of only a limited set of pre-specified peptides with
known fragmentation spectra is surveyed [41]. This targeted
approach overcomes many of the issues of shotgun methods, but
is usually not feasible for large-scale investigation of the complete
phosphoproteome.
Data-independent acquisition (DIA), e.g., SWATH-MS [42]
tries to address the shortcoming of both established data acquisi-
tion strategies in order to combine the throughput of DDA with
the reproducibility of SRM. In DIA, fragmentation spectra are
generated for all precursor ions in a specific window of m/z ratios,
leading to a complete map of fragmentation spectra, followed by
computational extraction of quantitative information for known
spectra. For phosphoproteomics, DIA-MS has already been applied
to investigate insulin signaling [43] or histone modifications
[44]. However, the spectra generated by DIA-MS are usually
highly complex and require intricate data extraction techniques,
which is even more challenging for modified peptides. Recently, a

computational resource for the detection of modified peptides has
been put forward [45]. Overall, the available methods for DIA have
as yet to mature in order to challenge the use of DDA in large-scale
studies of the phosphoproteome [24].
2.3 Quantitative As for regular proteomics, several experimental methods or post-

Phosphoproteomics acquisition tools exist to quantitate detected phosphopeptides.
Those can roughly be divided into isotope labeling and label-free
quantitation. In general, stable isotope labeling requires more
experimental effort than label-free quantitation, but at the same
time enables multiplexing of samples with different isotopes or
combinations.
Stable isotope labeling by metabolic incorporation of amino
acids (SILAC) is mainly used for cell cultures, in the medium of
which different stable isotopes are provided that will be
incorporated into the proteins of the cells. At the point of analysis,
cell extracts are mixed and then jointly investigated with mass
spectrometry. Mass differences between peptide pairs due to the
isotopic labeling can be exploited for relative quantitation
[46]. Currently, up to three conditions (light, medium, heavy)
can be multiplexed. Further developments of SILAC even pro-
duced an in-vivo SILAC mouse model for the proteomic and
phosphoproteomic analysis of skin cancerogenesis [47] and super-
SILAC for the analysis of tissues [48], in which a metabolically
labeled, tissue-specific protein mix from several cell lines, represent-
ing the complexity of the investigated proteome, is mixed with the
tissue lysate as internal standard for quantification.
Chemical modification of peptides with tandem mass tags
(TMT) or isobaric tags for relative and absolute quantitation
(iTRAQ) are two different methods based on tags with reactive
groups that bind to peptidyl amines in the peptides after protein
digestion. Again, different samples are mixed before mass spec-
trometry analysis, whereas for TMT or iTRAQ up to eight samples
can be multiplexed. In the first MS run, the peptides with different
isobaric tags are indistinguishable, but upon fragmentation in the
second MS run, each tag generates a unique reporter ion fragmen-
tation spectrum, which can be used for relative quantitation of the
tagged peptides [49, 50].
Label-free quantitation (LFQ), on the other hand, relies mainly
on post-acquisition data analysis, so that no modification of the
essential experimental workflow needs to be implemented. Com-
parison of an—in theory—unlimited number of different samples is
therefore possible, which is associated with the downside of pro-
longed analysis time as multiplexing samples is not possible. While
label-free approaches usually provide a deeper coverage of the
proteome than label-based methods, the reproducibility and preci-
sion of quantification are inferior, so that more technical replicates
are needed for confident quantification in LFQ [51]. Typically,

label-free quantitation is achieved by integration of peak area mea-
surements, i.e. the area under the curve, for individual peptides
[52] or spectral counting, which reflects that the probability to
sample more abundant peptides is higher [53].
For the case of phosphoproteomics, in contrast to regular
proteomics, an additional challenge for quantitation arises from
the fact that information from different peptides of the same pro-
tein cannot be integrated. While in regular proteomics the abun-
dances of every peptide in the protein can be combined, the
quantitation of a single phosphosite depends on direct measure-
ments of peptides with the specific modification. Therefore, the
sample sizes in phosphoproteomics quantitation are much smaller
and can even consist of the measurement of only a single
peptide [24].
Furthermore, different phosphosites within the same protein
will in many cases not show similar pattern of phosphorylation
dynamics. This may give rise to problems for subsequent analysis,
if this analysis is conducted on protein rather than on phosphosite
level.
2.4 Phosphosite Phosphopeptides in large-scale phosphoproteomics experiments

Assignment are identified from LC-MS/MS runs by interpreting MS/MS spec-
tra using a suitable search engine. Several of such search engines
now exist; popular ones include Mascot, Sequest, Protein Prospec-
tor, and Andromeda [54–57]. The process of determining peptide
sequences from MS/MS data involves matching the mass to charge
ratios of fragment ions in MS/MS spectra to the theoretical frag-
mentation of all protein-derived peptides in protein databases.
Depending on the organism being investigated, protein databases
from UniProt or NCBI are used. Each search engine has its own
scoring system to reflect the confidence of peptide identification,
which is a function of MS and MS/MS spectral quality. The false
discovery rate (FDR) may be determined by performing parallel
searches against scrambled or reversed protein databases containing
the same number of sequences as the authentic protein database.
The FDR is then calculated as the ratio of positive peptide identi-
fications in the decoy database divided by those derived from the
forward search. An FDR of 1% at the peptide level is normally
considered adequate.
Deriving peptide sequences with these methods is a relatively
straightforward process. However, site localization can be a prob-
lem when peptide sequences contain more than one amino acid
residue that can be phosphorylated. To address this problem, sev-
eral methods to determine precise localization of phosphorylation
within a phosphopeptide have been published. Ascore uses a prob-
abilistic approach to assess correct site assignment [58] and the
algorithm has been applied alongside the Sequest search engine.
The Mascot delta score, introduced by the Kuster group, simply

determines the differences in Mascot scores between the different
possibilities for phosphosite localization within a phosphopeptide
[59]. The larger the delta score, the greater the probability of
correct site assignment. Other similar methods have been published
[60] and some of them are now incorporated into search engines
[61]. The output of the phosphopeptide identification step gener-
ally contains scores for both the probability of correct peptide
sequence identification and phosphosite localization.
2.5 Pitfalls in the Although the available experimental methods for MS-based phos-
Analysis of MS-Based phoproteomics data acquisition have evolved considerably over the
Phospho- last years, leading to a steadily increasing number of detected
proteomics Data phosphosites, several limitations remain for the investigation of
signaling processes using phosphoproteomics data.
While it has been estimated that there are around 500,000
phosphorylation sites in the human proteome [62], the number of
phosphosites that can be identified in a single MS experiment usually
ranks around 10,000 to up to 40,000 [63]. Therefore, the sampled
phosphoproteomic picture is incomplete. It has to be taken into
account though, that, not all possible phosphorylation sites are
expected to be modified at the same time point. This is caused by
context-dependent regulation of phosphosites. For example, some
phosphosites are controlled differentially at different cell cycle stages,
while others only change under specific external stimulation such as
growth factors or other effector molecules [64, 65]. The hope is
therefore that a significantly larger portion of phosphosites could be
mapped with improving technology and by increasing the diversity
of biologically relevant conditions analyzed. So far though, in differ-
ent MS runs or replicates, usually a distinct set of phosphosites is
detected, as the selection of precursor ions is stochastic. This leads to
incomplete datasets with a high number of missing data points,
challenging computational investigation of the data such as cluster-
ing or correlation analysis. However, as discussed above, approaches
in which phosphopeptide intensities are compared across MS run
post-acquisition minimize this problem [38].
The functional impact of a phosphorylation event is known only
in the minority of cases [15]. Indeed, it has been hypothesized that a
substantial fraction of phosphorylation sites are non-functional [66],
since phosphorylation sites tend to be poorly conserved throughout
species [67]. Although approaches to studying the function of indi-
vidual phosphorylation events have been proposed [68], it may be
that a large part of the detected phosphosites serves no function at
all. Thus, non-functional sites add a substantial amount of noise to
phosphoproteomics data and complicate the computational analysis.
The inference of kinase activity from phosphoproteomics data
that will be described in the next section aims to overcome these
limitations, by the integration of the information from many
phosphosites, along prior knowledge on kinases-substrate relation-

ships, into a single measure for the kinase activity. It is important
though to keep in mind that any bias in the experimental workflow
will affect these scores. In particular, since highly abundant precur-
sor ions are more likely to be selected for fragmentation and there-
fore detection, targets of upregulated kinases are more probably
detected. Therefore, highly active kinases will be preferentially
detected, although downregulated kinases may be identified when
comparing different conditions.
3 Computational Methods for Inference of Kinase Activity
Traditionally, biochemical methods have been common to study

kinase activities in vitro and are still broadly used [69, 70]. However,
on the one hand those methods are generally limited in throughput
and time-consuming. On the other hand in vitro methods might
not accurately reflect the in vivo activities of kinases in a specific
cellular context. MS-based methods have also been applied for
assaying kinase activity [9, 10]. Here, the abundances of known
target phosphosites are monitored by MS after an in vitro enzy-
matic reaction.
Since every phosphorylation event results—by definition—
from the activity of a kinase, phosphoproteomics data should be
suitable to infer the activity of many kinases from a comparably low
experimental effort. This task requires computational analysis of the
detected phosphorylation sites (phosphosites), since thousands of
phosphosites can routinely be measured in a single experiment.
Several methods have been proposed in recent years, all of which
utilize prior knowledge about kinase-substrate interactions, either
from curated databases or from information about kinase recogni-
tion motifs.
3.1 Resources for As the large-scale detection of phosphorylation events using mass
Kinase-Substrate spectrometry became routine, many freely available databases that
Relationships collect experimentally verified phosphosites have been set up,
including PhosphoSitePlus [20], Phospho.ELM [21], Signor
[71], or PHOSIDA [72], to name just a few. The databases differ
in size and aim; PHOSIDA for example provides a tool for the
prediction of putative phosphorylation sites and recently also added
acetylation and other posttranslational modification sites to its
scope. Phospho.ELM computes a score for the conservation of a
phosphosite. Signor is focused on interactions between proteins
participating in signal transduction. PhosphoNetworks [73] is ded-
icated to kinase-substrate interactions, but the information is on
the level of proteins, not phosphosites. The arguably most promi-
nent database for expert-edited and curated interactions between
kinases and individual phosphosites (that have not been derived
from in vitro studies) is PhosphoSitePlus, currently encompassing

16,486 individual kinase-substrate relationships [04-2015]. For
Saccharomyces cerevisiae, the database PhosphoGRID provides
analogous information [74]. Specific information about targets of
phosphatases can be found in DEPOD [75]. Also in the Phospho.
ELM database, phosphosites have been associated with regulating
kinases, although this information is available for only about 10% of
the 37,145 human phosphosites in the database [04-2015].
As it has been estimated that there are between 100,000 [76]
and 500,000 [62] possible phosphosites in the human proteome,
the evident low coverage of the curated databases motivated the
development of computational tools to predict in vivo kinase-
substrate relationships. These methods identify putative new
kinase-substrate relationships based on experimentally derived
kinase recognition motifs, which was pioneered by Scansite [77]
that uses position-specific scoring matrices (PSSMs) obtained by
positional scanning of peptide libraries [78] or phage display meth-
ods [79]. Another approach, Netphorest [80] tries to classify phos-
phorylation sites according to the relevant kinase family instead of
predicting individual kinase-substrate links. However, the in vitro
specificity of kinases differs significantly from the kinase activity
inside of the cell, biasing the experimentally identified kinase rec-
ognition motifs [81]. The integration of contextual information,
for example co-expression, protein-protein interactions, or subcel-
lular colocalization, markedly improves the accuracy of the predic-
tions [69]. The software packages NetworKIN [82] (recently
extended in the context of the resource KinomeXplorer [22], cor-
recting for biases caused by over-studied proteins) and iGPS [23]
are examples for methods that combine information about kinase
recognition motifs, in vivo phosphorylation sites, and contextual
information, e.g., from the STRING database [83]. Recently,
Wagih et al. presented a method to predict kinase specificity for
kinases without any known phosphorylation sites [84]. Based on
the assumption that functional interaction partners of kinases
(derived from the STRING database) are more likely to be phos-
phorylated by the respective kinase, they should therefore contain
an amino acid motif conferring kinase specificity. This can then be
uncovered by motif enrichment.
The described methods provide predictions that are very valu-
able but not free from error, for example due to the described
differences in in vitro and in vivo kinase specificity or the influence
of subcellular localization. Thus, the predicted kinase-substrate
interactions should be considered hypotheses to be tested
experimentally [85].
We hereafter present four computational methods to infer
kinase activities from phosphoproteomics data, which use either
curated or computationally predicted kinase-substrate interactions.
3.2 GSEA Methodologically, inference of kinase activity from phosphoproteo-

mics data is related to the inference of transcription factor activity
based on gene expression data. A plethora of different methods has
been developed for the prediction of transcription factor activity,
e.g., the classical gene set enrichment analysis [86] or elaborated
machine learning methods [87].
For example, Drake et al. [88] analyzed the kinase signaling
network in castration-resistant prostate cancer with GSEA. They
predicted the kinases responsible for each phosphosite with kinase-
substrate interactions from PhosphoSitePlus, kinase recognition
motifs from PHOSIDA, and predictions from NetworKIN. Subse-
quently, they computed the enrichment of each kinase’ targets with
the gene set enrichment algorithm after Subramanian et al. [86],
which corresponds to a Kolmogorov–Smirnov-like statistic. The
significance of the enrichment score is determined based on per-
mutation tests, whereas the p-value depends on the number of
permutations.
Alternatively, the gene set enrichment web-tool Enrichr
[89, 90] can also be used for enrichment of kinases [91]. The
authors compiled kinases-substrate interactions from different
databases and extracted additional interactions manually from the
literature in order to generate kinase-targets sets. Furthermore,
they added protein-protein interactions involving kinases from the
Human Protein Reference Database (HPRD) [92], based on the
assumption that those are highly enriched in kinase-substrate inter-
actions. Using this prior knowledge, the enrichment of the targets
of a kinase is then computed with Fisher’s exact test as described
in [89].
3.3 KAA Another approach to link phosphoproteomics data with the activity
of kinases was presented in a publication from Qi et al. [16], which
they termed kinase activity analysis (KAA).
In this study, the authors collected phosphoproteomics data
from adult mouse testis in order to investigate the process of
mammalian spermatogenesis. With the software package iGPS
[23] they predicted putative kinase-substrate relationships for the
detected phosphosites. The authors hypothesized that the number
of links for a given kinase in the predicted kinase-substrate network
can serve as proxy for the activity of this kinase in the specific cell
type. This activity was then compared to the kinase activity back-
ground which was calculated by computing the number of links in
the background kinase-substrate network based on the mouse
phosphorylation atlas by Huttlin et al. [93]. Qi and colleagues
predicted high activity of PLK kinases in adult mouse testis and
could validate this prediction experimentally.
However, there are several limitations of KAA. For once, it is
mainly based on computational predictions of kinase substrate
relationships, which are known to be susceptible to errors
[69, 85]. Additionally, in their method the activity of a kinase is

only dependent on the number of detected, putative targets. The
abundance of the individual phosphosites or the fold change
between conditions is not taken into account.
De Graaf et al. [94] chose a comparable approach in a study of
the phosphoproteome of Jurkat T cells after stimulation with pros-
taglandin E2. However, they did not explicitly calculate kinase
activities. Instead, they grouped phosphosites into different clusters
with distinct temporal profiles and used the NetworKIN algorithm
[82] to calculate the enrichment of putative targets of a given kinase
in a specific cluster. As a result, they associated kinases with tempo-
ral activity profiles based on the enrichment in one of the detected
clusters.
3.4 CLUE A method designed specifically for time-course phosphoproteomics

data is the knowledge-based CLUster Evaluation approach, in
short CLUE [18]. This method is based on the assumption that
phosphosites targeted by the same kinase will show similar tempo-
ral profiles, which is utilized to guide a clustering algorithm and
infer kinases associated with these clusters. As in the study by de
Graaf et al. [94], kinases are not associated with distinct values for
activities but rather with temporal activity profiles. The notable
distinction of CLUE is that the clustering is found based on the
prior knowledge about kinase-substrate relationships, as outlined
below.
Methodologically, CLUE uses the k-means clustering algo-
rithm to group the phosphoproteomics data into clusters in which
the phosphosites show similar temporal kinetics. The performance
of k-means clustering is particularly sensitive to the parameter k,
i.e., the number of clusters. CLUE therefore tests a range of differ-
ent values for k and evaluates them based on the enrichment of
kinase-substrate relationships in the identified clusters. The method
utilizes the data from the PhosphoSitePlus database in order to
derive prior knowledge about kinase-substrate relationships. With
Fisher’s exact test the enrichment of the targets of a given kinase in
a specific cluster is tested for significance. The implemented scoring
system penalizes distribution of the targets of a single kinase
throughout several clusters, as well as the combination of unrelated
phosphosites in the same cluster.
CLUE is freely available as R package in the Comprehensive R
Archive Network CRAN under https://cran.r-project.org/web/
packages/ClueR/index.html.
A limitation of CLUE is represented by the fact that possible
‘noise’ in the prior knowledge, i.e., incorrect annotations, poten-
tially derived from cell type-specific kinase-substrate relationships,
can affect the performance of the clustering, although simulations
showed reasonable robustness. CLUE is tailored toward time-
course phosphoproteomics data and associates kinases with
temporal activity profiles. Since the method does not provide sin-
gular activity scores for each kinase, it may be only partly applicable
to experiments in which the individual responses of kinases to
different treatments or conditions are of interest.
3.5 KSEA Casado et al. [17] presented a method for kinase activity estimation
based on kinase-substrate sets. Using kinase-substrate relationships
derived from the databases PhosphoSitePlus and Phospho.ELM, all
phosphosites that are targeted by a given kinase can be grouped
together into a substrate set (see Fig. 1 for an outline of the work-
flow). In theory, these phosphosites should show similar values,
since they are targeted by the same kinase. However, due to the
transient and therefore inherently noisy nature of phosphorylation,
Casado and colleagues proposed integrating the information from
all phosphosites in the substrate set in order to enhance the signal-
to-noise ratio by signal averaging [95].
For KSEA, log2-transformed fold change data is needed, i.e.,
the change of the abundance of a phosphosite between the initial
and treated states, initial and later time points, or between two
different cell types. Therefore, KSEA activity scores describe the
activity of a kinase in one condition relative to another.
The authors suggested three possible metrics (mean score,
alternative mean score, and delta score) that can be extracted out
of the substrate set and serve as proxy for kinase activity: (1) The
Data Input Prior Knowledge

Conditions
Kinase X
Y287 T42
P-sites
S29 Protein
Protein
3
1
Protein and
2 others
4
2
Statistical
Kinase
log2(Fold
Change)
Analysis
0 Activity
-2 Score
-4
P-sites in substrate set of kinase X
Fig. 1 Work-flow of methods to obtain Kinase activity scores such as KSEA. As prior knowledge, the targets of
a given kinase are extracted out of curated databases like PhosphoSitePlus. Together with the data of the
detected phosphosites, substrate sets are constructed for each kinase, from which an activity score can be
calculated
main activity score, also used in following publications [96], is

defined as the mean of the log2 fold changes of the phosphosites
in the substrate set; (2) alternatively, only phosphosites with signifi-
cant fold changes can be considered for the calculation of the mean;
and (3) for the last approach, termed “delta count,” the occurrence
of significantly upregulated phosphosites in the substrate set is
counted, from which the number of significantly downregulated
sites is subtracted. For each method, the significance of the kinase
activity score is tested with an appropriate statistical test. In the
publication of Casado et al., all three measures were in good agree-
ment, even if spanning different numerical ranges (see Fig. 2). The
implementation of these three methods is discussed in detail in the
following section.
Like the other methods described in this section, KSEA
strongly depends on the prior knowledge kinase-substrate relation-
ships available in the freely accessible databases. These are far from
complete and therefore limit the analytical depth of the kinase
activity analysis. Additionally, databases are generally biased toward
well-studied kinases or pathways [22], so that the sizes of the
different substrate sets differ considerably. Casado et al. tried to
address these limitations by integrating information about kinase
recognition motifs and obtained comparable results.
A detailed protocol on how to use KSEA is provided in
Subheading 4.
3.6 IKAP Recently, Mischnik and colleagues introduced a machine-learning

method to estimate kinase activities and to predict putative kinase-
substrate relationships from phosphoproteomics data [19].
In their model for kinase activity, the effect e of a given kinase
j on a single phosphosite i is modeled with
e ji ¼ kj pji
as a product of the kinase activity k and the affinity p of kinase j for
phosphosite i. The abundance P of the phosphosite i is expressed as
mean of all effects acting on it, since several kinases can regulate the
same phosphosite:
X
m X
m
Pi ¼ e ji = pji
j ¼1 j ¼1
The information about the kinase-substrate relationships is also

derived from the PhosphoSitePlus database. Using a nonlinear
optimization routine, IKAP estimates the described parameters
while minimizing a least square cost function between predicted
and measured phosphosite abundance throughout time points or
conditions. For this optimization, the affinity parameters are esti-
mated globally, while the kinase activities are fitted separately for
each time point.
A 0.3 B 0.8
mean mean
0.2
median 0.5 median
0.2
activity score
activity score
0.1 0.3
0.0 0.0
-0.1
-0.3
-0.2
-0.5
-0.2
-0.3 -0.8
5 10 20 30 60 5 10 20 30 60
C 30
time [min] D 4
time [min]
log2(fold change)
20 3
activity score
10 1
0 0
-10 -1
-20 -3
-30 -4
5 10 20 30 60 5 10 20 30 60
time [min] time [min]

Fig. 2 KSEA activity scores for Casein kinase II subunit alpha. (a) Activity scores for Casein kinase II subunit
alpha over all time points of the de Graaf dataset [94], calculated as the mean of all phosphosites in the
substrate set. In yellow, the median has been used. (b) Activity scores for Casein kinase II subunit alpha over
all time points of the de Graaf dataset, calculated as the mean of all significantly regulated phosphosites in the
substrate set. The median is again shown in yellow. (c) Delta score for Casein kinase II subunit alpha over all
time points of the de Graaf dataset, calculated as number of significantly upregulated phosphosites minus the
number of significantly downregulated phosphosites in the substrate set. (d) The log2 fold changes for all time
points for all phosphosites in the substrate set of the Casein kinase II subunit alpha
In a second step, putative new kinase-substrate relationships are

predicted based on the correlation of a phosphosite with the esti-
mated activity of a kinase throughout time points or conditions.
These predictions are then tested by database searches and by
comparison to kinase recognition motifs from NetworKIN.
In contrast to KSEA, which computes the kinase activity based
on the fold changes of the phosphosites in the respective substrate
set, IKAP is built on a heuristic machine learning algorithm and
tries to fit globally the described model of kinase activity and affinity
to the phosphoproteomics data. Therefore, the output of IKAP is
not only a score for the activity of a kinase, but also a value
representing the strength of a specific kinase-substrate interaction
in the investigated cell type. On the other hand, the amount of

parameters that have to be estimated is rather large, so that a fair
number of experimental conditions or time points are needed for
unique solutions. Mischnik et al. included a function to perform an
identifiability analysis of the obtained kinase activities and could
show in the case of the two investigated example datasets that the
found solutions are indeed unique on the basis of the phosphopro-
teomics measurements.
The MATLAB code for IKAP can be found online under www.
github.com/marcel-mischnik/IKAP/, accompanied by an exten-
sive step-by-step documentation, which we recommend as addi-
tional reading to the interested reader.
4 Protocol for KSEA
In this section, we present a stepwise, guided protocol for the

KSEA approach to infer kinase activities from phosphoproteomics
data. This protocol (part of the Kinase Activity Toolbox under
https://github.com/saezlab/kinact) is accompanied by a freely
available script, written in the Python programming language
(Python version 2.7.x) that should enable the use of KSEA for
any phosphoproteomics dataset. We plan to expand Kinact to
other methods in the future. We are going to explain the performed
computations in detail in the following protocol to facilitate under-
standing and to enable a potential re-implementation into other
programming languages.
As an example application, we will use KSEA on the phospho-
proteomics dataset from de Graaf et al. [94], which was derived
from Jurkat T cells stimulated with prostaglandin E2 and is available
as supplemental information to the article online at http://www.
mcponline.org/content/13/9/2426/suppl/DC1
4.1 Quick Start As a quick start for practiced Python users, we can use the utility
functions from kinact to load the example dataset. The data should
be organized as Pandas DataFrame containing the log2-
transformed fold changes, while the columns represent different
conditions or time points and the row individual phosphosites. The
p-value of the fold change is optional, but should be organized in
the same way as the data.
import kinact
data_fc, data_p_value ¼ kinact.get_example_data()
print data_fc.head()
>>> 5min 10min 20min 30min 60min
>>> ID
>>> A0AVK6_S71 -0.319306 -0.484960 -0.798082 -0.856103
-0.928753
>>> A0FGR8_S743 -0.856661 -0.981951 -1.500412 -1.441868

-0.861470
>>> A0FGR8_S758 -1.445386 -2.397915 -2.692994 -2.794762
-1.553398
>>> A0FGR8_S691 0.271458 0.264596 0.501685 0.461984
0.655501
>>> A0JLT2_S226 -0.080786 1.069710 0.519780 0.520883
-0.296040
The kinase-substrate relationships have to be loaded as well

with the function get_kinase_targets(). In this function call, we
can specify with the ‘sources’-parameter, from which databases we
want to integrate the information about kinase-substrate relation-
ships, e.g., PhosphoSitePlus, Phospho.ELM, or Signor. The func-
tion uses an interface to the pypath python package, which
integrates several resources for curated signaling pathways [97]
(see also Note 1).
kin_sub_interactions ¼ kinact.get_kinase_targets(sources¼
[‘all’])
An important requirement for the following analysis is that the

structure of the indices of the rows of the data and the prior
knowledge need to be the same (see below for more detail). As an
example, KSEA can be performed for the condition of 5 min after
stimulation in the de Graaf dataset using:
activities, p_values ¼ kinact.ksea.ksea_mean(data_fc[‘5min’],

kin_sub_interactions, mP¼data_fc.values.mean(),
delta¼data_fc.values.std())
print activities.head()
>>> AKT1 0.243170
>>> AKT2 0.325643
>>> ATM -0.127511
>>> ATR -0.141812
>>> AURKA 1.783135
>>> dtype: float64
Besides the data (data_fc[‘5min’]) and kinase-substrate inter-

actions (kin_sub_interactions), the variables ‘mP’ and ‘delta’ are
needed to determine the z-score of the enrichment. The z-score
builds the basis for the p-value calculation. The p-values for all
kinases are corrected for multiple testing with the Benjamini-
Hochberg procedure [98].
In Fig. 2, the different activity scores for the Casein kinase II
alpha, which de Graaf et al. had associated with increased activity
after prolonged stimulation with prostaglandin E2, are shown
together with the log2 fold change values of all phosphosites that
are known to be targeted by this kinase. For methods, which use

the mean, the median as more robust measure can be calculated
alternatively. The qualitative changes of the kinase activities
(Fig. 2a–c) are quite similar regardless of the method, and would
not be apparent from looking at any specific substrate phosphosite
alone (Fig. 2d).
4.2 Loading the Data In the following, we walk the reader step by step through the
procedure for KSEA. First, we need to organize the data so that
the KSEA functions can interpret it.
In Python, the library Pandas [99] provides useful data struc-
tures and powerful tools for data analysis. Since the provided script
depends on many utilities from this library, we would strongly
advice the reader to have a look at the Pandas documentation,
although it will not be crucial in order to understand the presented
protocol. The library, together with the NumPy [100] package, can
be loaded with:
import pandas as pd
import numpy as np
The data accompanying the article is provided as Excel spread-

sheet and can be imported to python using the pandas ‘read_excel’
function or first be saved as csv-file, using the ‘Save As’ function in
Excel in order to use it as described below. For convenience, in the
referenced Github repository, the data is already stored as csv-file,
so that this step is not necessary. The data can be loaded with the
function ‘read_csv’, which will return a Pandas DataFrame contain-
ing the data organized in rows and columns.
data_raw ¼ pd.read_csv(‘FILEPATH’, sep¼‘,’)
In the DataFrame object ‘data_raw’, the columns represent the

different experimental conditions or additional information and the
row’s unique phosphosites. A good way to gain an overview about
the data stored in a DataFrame and to keep track of changes are the
following functions:
print data_raw.head() to show the first five rows of the Data-
Frame or print data_raw.shape in order to show the dimensions of
the DataFrame.
Phosphosites that can be matched to different proteins or
several positions within the same protein are excluded from the
analysis. In this example, ambiguous matching is indicated by the
presence of a semicolon that separates multiple possible identifiers,
and can be removed like this:
data_reduced ¼ data_raw[~data_raw[‘Proteins’].str.contains
(‘;’)]
For more convenient data handling, we will index each phos-

phosite with an unambiguous identifier comprising the UniProt
accession number, the type of the modified residue, and the posi-
tion within the protein. For the example of a phosphorylation of
the serine 59 in the Tyrosine-protein kinase Lck, the identifier
would be P06239_S59. The identifier can be constructed by con-
catenating the information that should be provided in the dataset.
In the example of de Graaf et al., the UniProt accession number can
be found in the column ‘Proteins’, the modified residue in ‘Amino
acid’, and the position in ‘Positions within proteins’.
The index is used to access the rows in a DataFrame and will
later be needed to construct the kinase-substrate sets. After the
creation of the identifier, the DataFrame is indexed by calling the
function ‘set_index’.
data_reduced[‘ID’] ¼ data_reduced[‘Proteins’] + ‘_’ +

data_reduced[‘Amino acid’] +
data_reduced[‘Positions within proteins’]
data_indexed ¼ data_reduced.set_index(data_reduced[‘ID’])
Mass spectrometry data is usually accompanied by several col-

umns containing additional information about the phosphosite
(e.g., the sequence window) or statistics of the database search
(for example the posterior error probability), which are not neces-
sarily needed for KSEA. We therefore extract only the columns of
interest containing the processed data. In the example dataset, the
names of the crucial columns start with ‘Average’, enabling selec-
tion by a simple ‘if’ statement. Generally, more complex selection
of column names can be achieved by regular expressions with the
python module ‘re’.
data_intensity ¼ data_indexed[[x for x in data_indexed

if x.startswith(‘Average’)]] # (see Note 2)
Now, we can compute the fold change compared to the con-

trol, which is the condition of 0 min after stimulation. With log(a/
b) ¼ log(a) log(b), we obtain the fold changes by subtracting
the column with the control values from the rest using the ‘sub’
function of Pandas (see Note 3).
data_fc ¼ data_intensity.sub(data_intensity[‘Average Log2 In-

tensity 0min’], axis¼0)
Further data cleaning (re-naming columns and removal of the

columns for the control time point) results in the final dataset:
data_fc.columns ¼ [x.split()[-1] for x in data_fc] # Rename

columns
data_fc.drop(’0min’, axis¼1, inplace¼True) # Delete control

column
print data_fc.head()
>>> 5min 10min 20min 30min 60min
>>> ID
>>> A0AVK6_S71 -0.319306 -0.484960 -0.798082 -0.856103
-0.928753
>>> A0FGR8_S743 -0.856661 -0.981951 -1.500412 -1.441868
-0.861470
>>> A0FGR8_S758 -1.445386 -2.397915 -2.692994 -2.794762
-1.553398
>>> A0FGR8_S691 0.271458 0.264596 0.501685 0.461984
0.655501
>>> A0JLT2_S226 -0.080786 1.069710 0.519780 0.520883
-0.296040
If the experiments have been performed with several replicates,

statistical analysis enables estimation of the significance of the fold
change compared to a control expressed by a p-value. The p-value
will be needed to perform KSEA using the ‘Delta count’ approach
but may be dispensable for the mean methods. The example dataset
contains a p-value (transformed as negative logarithm with base 10)
in selected columns and can be extracted using:
data_p_value ¼ data_indexed[[x for x in data_indexed

if x.startswith(‘p value’)]]
data_p_value ¼ data_p_value.astype(‘float’) # (see Note 4)
4.3 Loading the Now, we load the prior knowledge about kinase-substrate relation-
Kinase-Substrate ships. In this example, we use the information provided in the
Relationships PhosphoSitePlus database (see Note 5), which can be downloaded
from the website www.phosphosite.org. The organization of the
data from comparable databases, e.g., Phospho.ELM, does not
differ drastically from the one from PhosphoSitePlus and therefore
requires only minor modifications. Using ‘read_csv’ again, we load
the downloaded file with:
ks_rel ¼ pd.read_csv(‘FILEPATH’, sep¼’\t’) # (see Note 6)
In this file, every row corresponds to an interaction between a

kinase and a unique phosphosite. However, it must first be
restricted to the organism of interest, e.g., ‘human’ or ‘mouse’,
since the interactions of different organisms are reported together
in PhosphoSitePlus.
ks_rel_human ¼ ks_rel.loc[(ks_rel[‘KIN_ORGANISM’] ¼¼ ‘human’) &

(ks_rel[‘SUB_ORGANISM’] ¼¼ ‘human’)]
Next, we again construct unique identifiers for each phospho-

site using the information provided in the dataset. The modified
residue and its position are already combined in the provided data.
ks_rel_human[‘psite’] ¼ ks_rel_human[‘SUB_ACC_ID’] +
‘_’ + ks_rel_human[‘SUB_MOD_RSD’]
Now, we construct an adjacency matrix for the phosphosites

and the kinases. In this matrix, an interaction between a kinase and
a phosphosite is denoted with a 1, all other fields are filled with a 0.
For this, the Pandas function ‘pivot_table’ can be used:
ks_rel_human[‘value’] ¼ 1 # (see Note 7)

adj_matrix ¼ pd.pivot_table(ks_rel_human, values¼‘value’,
index¼‘psite’, columns¼‘GENE’, fill_value¼0)
The result is an adjacency matrix of the form m n with

m being the number of phosphosites and n the number of kinases.
If a kinase is known to phosphorylate a given phosphosite, the
corresponding entry in this matrix will be a 1, otherwise a 0. A
0 does not mean that there cannot be an interaction between the
kinase and the respective phosphosite, but rather that this specific
interaction has not been reported in the literature. As sanity check,
we can print the number of known kinase-substrate interactions for
each kinase saved in the adjacency matrix:
print adj_matrix.sum(axis¼0).sort_values(ascending¼False).
head()
>>> GENE
>>> CDK2 541
>>> CDK1 458
>>> PRKACA 440
>>> CSNK2A1 437
>>> SRC 391
>>> dtype: int64
4.4 KSEA In the accompanying toolbox, we provide for each method of

KSEA a custom python function that automates the analysis for
all kinases in a given condition. Here, however, we demonstrate the
principle of KSEA by computing the different activity scores for a
single kinase and a single condition. As an example, the Cyclin-
dependent kinase 1 (CDK1, see Note 8) and the condition of
60 min after prostaglandin stimulation shall be used.
data_condition ¼ data_fc[‘60min’].copy()
p_values ¼ data_p_value[‘p value_60vs0min’]
kinase ¼ ‘CDK1’
First, we determine the overlap between the known targets of

the kinase and the detected phosphosites in this condition, because
we need it for every method of KSEA. Now, we benefit from having
the same format for the index of the dataset and the adjacency
matrix. We can use the Python function ‘intersection’ to determine
the overlap between two sets.
substrate_set ¼ adj_matrix[kinase].replace(
0, np.nan).dropna().index # (see Note 9)
detected_p_sites ¼ data_condition.index
intersect¼list(set(substrate_set).intersection(detected_p_-
sites))
print len(intersect)
>>> 114
4.4.1 KSEA Using the For the “mean” method, the KSEA score is equal to the mean of
“Mean” Method the fold changes in the substrate set mS.
The significance of the score is tested with a z-statistic using
pffiffiffiffiffi
mS mP m
z¼
δ
with mP as mean of the complete dataset, m being the size of the
substrate set, and δ the standard deviation of the complete dataset,
adapted from the PAGE method for gene set enrichment
[101]. The “mean” method has established itself as the preferred
method in the Cutillas lab that developed the KSEA approach.
mS ¼ data_condition.ix[intersect].mean()
mP ¼ data_fc.values.mean()
m ¼ len(intersect)
delta ¼ data_fc.values.std()
z_score ¼ (mS - mP) * np.sqrt(m) * 1/delta
The z-score can be converted into a p-value with a function

from the SciPy [102] library:
from scipy.stats import norm

p_value_mean ¼ norm.sf(abs(z_score))
print mS, p_value_mean
>>> -0.441268760191 9.26894825183e-07
4.4.2 KSEA Using the Alternatively, only the phosphosites in the substrate set that change
Alternative ‘Mean’ Method significantly between conditions can be considered when comput-
ing the mean of the fold changes in the substrate set. Therefore, we
need a cutoff, determining a significant increase or decrease, respec-
tively, which can be a user-supplied parameter. Here, we use a
standard level to define a significant change with a cutoff of 0.05.

The significance of the KSEA score is tested as before with the z-
statistic.
cut_off ¼ -np.log10(0.05)
set_alt ¼ data_condition.ix[intersect].where(
p_values.ix[intersect] > cut_off).dropna()
mS_alt ¼ set_alt.mean()
z_score_alt ¼ (mS_alt - mP) * np.sqrt(len(set_alt)) * 1/delta
p_value_mean_alt ¼ norm.sf(abs(z_score_alt))
print mS_alt, p_value_mean_alt
>>> -0.680835732551 1.26298232031e-13
4.4.3 KSEA Using the In the “Delta count” method, we count the number of phospho-
“Delta Count” Method sites in the substrate set that are significantly increased in the
condition versus the control and subtract the number of phospho-
sites that are significantly decreased.
cut_off ¼ -np.log10(0.05)
score_delta ¼ len(data_condition.ix[intersect].where(
(data_condition.ix[intersect] > 0) &
(p_values.ix[intersect] > cut_off)).dropna()) -
len(data_condition.ix[intersect].where(
(data_condition.ix[intersect] < 0) &
(p_values.ix[intersect] > cut_off)).dropna()) # (see Note 10)
The p-value of the score is calculated with a hypergeometric

test, since the number of significantly regulated phosphosites is a
discrete variable. To initialize the hypergeometric distribution, we
need as variables M ¼ the total number of detected phosphosites,
n ¼ the size of the substrate set, and N ¼ the total number of
phosphosites that are in an arbitrary substrate set and significantly
regulated.
from scipy.stats import hypergeom

M ¼ len(data_condition)
n ¼ len(intersect)
N ¼ len(np.where(
p_values.ix[adj_matrix.index.tolist()] > cut_off)[0])
hypergeom_dist ¼ hypergeom(M, n, N)
p_value_delta ¼ hypergeom_dist.pmf(len(
p_values.ix[intersect].where(
p_values.ix[intersect] > cut_off).dropna()))
print score_delta, p_value_delta
>>> -58 8.42823410966e-119
5 Closing Remarks
In summary, the methods described in this review use different

approaches to calculate kinase activities or to relate kinases to
activity profiles from phosphoproteomics datasets. All of them
utilize prior knowledge about kinase-substrate relationships, either
from curated databases or from computational prediction tools.
Using these methods, the noisy and complex information from
the vast amount of detected phosphorylation sites can be
condensed into a much smaller set of kinase activities that is easier
to interpret. Modeling of signaling pathways or prediction of drug
responses can be performed in a straightforward way with these
kinase activities as shown in the study by Casado et al. [17].
The power of the described methods strongly depends on the
available prior knowledge about kinase-substrate relationships. As
our knowledge increases due to experimental methods like in vitro
kinase selectivity studies [103] or the CEASAR (Connecting
Enzymes And Substrates at Amino acid Resolution) approach
[104], the utility and applicability of methods for inference of
kinase activities will grow as well. Additionally, the computational
approaches for the prediction of possible kinase-substrate relation-
ships are under on-going development [84, 105], increasing the
reliability of the in silico predictions.
Phosphoproteomic data is not only valuable for the analysis of
kinase activities: for example, PTMfunc is a computational resource
that predicts the functional impact of posttranslational modifica-
tions based on structural and domain information [15], and PHO-
NEMeS [96, 106] combines phosphoproteomics data with prior
knowledge kinase-substrate relationships, in a similar fashion as
kinase-activity methods. However, instead of scoring kinases,
PHONEMeS derives logic models for signaling pathways at the
phosphosite level.
For the analysis of deregulated signaling in cancer, mutations in
key signaling molecules can be of crucial importance. Recently,
Creixell and colleagues presented a systematic classification of
genomic variants that can perturb signaling, either by rewiring of
the signaling network or by the destruction of phosphorylation
sites [107]. Another approach was introduced in the last update
of the PhosphoSitePlus database, in which the authors reported
with PTMVar [20] the addition of a dataset that can map missense
mutation onto the posttranslational modifications. With these
tools, the challenging task of creating an intersection between
genomic variations and signaling processes may be addressed.
It remains to be seen how the different scoring metrics for
kinase activity relate to each other, as they utilize different
approaches to extract a kinase activity score out of the data. IKAP
is based on a nonlinear optimization for the model of kinase-
dependent phosphorylation, KSEA on statistical analysis of the

values in the substrate set of a kinase, and CLUE on the k-means
clustering algorithm together with Fisher’s exact test for enrich-
ment. In a recent publication by Hernandez-Armenta et al. [108],
the authors compiled a benchmark dataset from the literature,
consisting of phosphoproteomic experiments under perturbation.
For each experiment, specific kinases are expected to be regulated,
e.g., EGFR receptor tyrosine kinase after stimulation with EGF.
Using this “gold standard,” the authors assessed how well different
methods for the inference of kinase activities could recapitulate the
expected kinase regulation in the different conditions. All of the
assessed methods performed comparably strongly, but the authors
observed a strong dependency on the prior knowledge about
kinase-substrate relationships. This is a first effort to assess the
applicability, performance, and drawbacks of the different methods,
thereby guiding the use of phosphoproteomics data to infer kinase
activities, from which to derive insights into molecular cancer biol-
ogy and many other processes controlled by signal transduction.
6 Notes
1. To the sources parameter in the function get_kinase_targets,

either a list of kinase-substrate interaction sources that are avail-
able in pypath or ‘all’ in order to include all sources can be
passed. If no source is specified, only the interactions from
PhosphoSitePlus and Signor will be used. The available sources
in pypath are “ARN” (Autophagy Regulatory Network) [109],
“CA1” (Human Hippocampal CA1 Region Neurons Signaling
Network) [110], “dbPTM” [111], “DEPOD” [75], “HPRD”
(Human Protein Reference Database) [92], “MIMP” (Muta-
tion IMpact on Phosphorylation) [112], “Macrophage” (Mac-
rophage pathways) [113], “NRF2ome” [114], “phosphoELM”
[21], “PhosphoSite” [20], “SPIKE” (Signaling Pathway
Integrated Knowledge Engine) [115], “SignaLink3” [116],
“Signor” [71], and “TRIP” (Mammalian Transient Receptor
Potential Channel-Interacting Protein Database) [117].
2. The provided code is equivalent to:
intensity_columns ¼ []
for x in data_indexed:
...if x.starstwith(‘Average’):
... ...intensity_columns.append(x)
data_intensity ¼ data_indexed[intensity_columns]
3. In our example, it is not necessary to transform the data to log2

intensities, since the data is already provided after log2-
transformation. But for raw intensity values, the following func-
tion from the NumPy module can be used:
data_log2 ¼ np.log2(data_intensity)
4. Due to a compatibility problem with the output of Excel,

Python recognizes the p-values as string variables, not as floating
point numbers. Therefore, this line is needed to convert the type
of the p-values.
5. The adjacency matrix can also be constructed based on kinase
recognition motifs or kinase prediction scores and the amino
acid sequence surrounding the phosphosite. To use NetworKIN
scores for the creation of the adjacency matrix, kinact will pro-
vide dedicated functions. In the presented example, however, we
focus on the curated kinase-substrate relationships from
PhosphoSitePlus.
6. The file from PhosphoSitePlus is provided as text file in which a
tab (‘\t’) delimits the individual fields, not a comma. The file
contains a disclaimer at the top, which has to be removed first.
Alternatively, the option ‘skiprows’ in the function ‘read_csv’
can be set in order to ignore the disclaimer.
7. This column is needed, so that in the matrix resulting from pd.
pivot_table the value from this column will be entered.
8. If necessary, mapping between protein names, gene names, and
UniProt-Accession numbers can easily be performed with the
Python module ‘bioservices’, to the documentation of which we
want the refer the reader [118].
9. In this statement, we first select the relevant columns of the
kinase from the connectivity matrix (adj_matrix[kinase]). In
this column, we replace all 0 values with NAs (replace(0, np.
nan)), which are then deleted with dropna(). Therefore, only
those interactions remain, for which a 1 had been entered in the
matrix. Of these interactions, we extract the index, which is a list
of the phosphosites known to be targeted by the kinase of
interest.
10. The where method will return a copy of the DataFrame, in
which for cases where the condition is not true, NA is returned.
dropna will therefore delete all those occurrences, so that len
will count how often the condition is true.
Acknowledgments
Thanks to Emanuel Gonçalves, Aurélien Dugourd, and Claudia

Hernández-Armenta for comments on the manuscript. For help
with the code, thanks to Emanuel Gonçalves.
References
1. Jørgensen C, Linding R (2010) Simplistic 12. Doll S, Burlingame AL (2015) Mass
pathways or complex networks? Curr Opin spectrometry-based detection and assignment
Genet Dev 20:15–22 of protein posttranslational modifications.
2. Hanahan D, Weinberg RA (2011) Hallmarks ACS Chem Biol 10:63–71
of cancer: the next generation. Cell 13. Choudhary C, Mann M (2010) Decoding
144:646–674 signalling networks by mass spectrometry-
3. Sawyers CL (1999) Chronic myeloid leuke- based proteomics. Nat Rev Mol Cell Biol
mia. N Engl J Med 340:1330–1340 11:427–439
4. Sawyers CL, Hochhaus A, Feldman E et al 14. Sabidó E, Selevsek N, Aebersold R (2012)
(2002) Imatinib induces hematologic and Mass spectrometry-based proteomics for sys-
cytogenetic responses in patients with chronic tems biology. Curr Opin Biotechnol
myelogenous leukemia in myeloid blast crisis: 23:591–597
results of a phase II study. Blood 15. Beltrao P, Albanèse V, Kenner LR et al (2012)
99:3530–3539 Systematic functional prioritization of protein
5. Zhang J, Yang PL, Gray NS (2009) Targeting posttranslational modifications. Cell
cancer with small molecule kinase inhibitors. 150:413–425
Nat Rev Cancer 9:28–39 16. Qi L, Liu Z, Wang J et al (2014) Systematic
6. Gonzalez de Castro D, Clarke PA, analysis of the phosphoproteome and kinase-
Al-Lazikani B et al (2012) Personalized can- substrate networks in the mouse testis. Mol
cer medicine: molecular diagnostics, predic- Cell Proteomics 13:3626–3638
tive biomarkers and drug resistance. Clin 17. Casado P, Rodriguez-Prados J-C, Cosulich
Pharmacol Ther 93:252–259 SC et al (2013) Kinase-substrate enrichment
7. Cutillas PR (2015) Role of phosphoproteo- analysis provides insights into the heterogene-
mics in the development of personalized cancer ity of signaling pathway activation in leukemia
therapies. Proteomics Clin Appl 9:383–395 cells. Sci Signal 6:rs6
8. Bertacchini J, Guida M, Accordi B et al 18. Yang P, Zheng X, Jayaswal V et al (2015)
(2014) Feedbacks and adaptive capabilities Knowledge-based analysis for detecting key
of the PI3K/Akt/mTOR axis in acute mye- signaling events from time-series Phospho-
loid leukemia revealed by pathway selective proteomics data. PLoS Comput Biol 11:
inhibition and phosphoproteome analysis. e1004403
Leukemia 28:2197–2205 19. Mischnik M, Sacco F, Cox J et al (2015)
9. Cutillas PR, Khwaja A, Graupera M et al IKAP: a heuristic framework for inference of
(2006) Ultrasensitive and absolute quantifica- kinase activities from Phosphoproteomics
tion of the phosphoinositide 3-kinase/Akt data. Bioinformatics 32(3):424–431
signal transduction pathway by mass spec- 20. Hornbeck PV, Zhang B, Murray B et al
trometry. Proc Natl Acad Sci U S A (2015) PhosphoSitePlus, 2014: mutations,
103:8959–8964 PTMs and recalibrations. Nucleic Acids Res
10. Yu Y, Anjum R, Kubota K et al (2009) A site- 43:D512–D520
specific, multiplexed kinase activity assay using 21. Dinkel H, Chica C, Via A et al (2011) Phos-
stable-isotope dilution and high-resolution pho.ELM: a database of phosphorylation
mass spectrometry. Proc Natl Acad Sci U S A sites—update 2011. Nucleic Acids Res 39:
106:11606–11611 D261–D267
11. McAllister FE, Niepel M, Haas W et al (2013) 22. Horn H, Schoof EM, Kim J et al (2014)
Mass spectrometry based method to increase KinomeXplorer: an integrated platform for
throughput for kinome analyses using ATP kinome biology studies. Nat Methods
probes. Anal Chem 85:4666–4674 11:603–604
23. Song C, Ye M, Liu Z et al (2012) Systematic cell lymphoma cell line. Mol Cell Proteomics
analysis of protein phosphorylation networks 4:1038–1051
from phosphoproteomic data. Mol Cell Pro- 37. Bateman NW, Goulding SP, Shulman NJ et al
teomics 11:1070–1083 (2014) Maximizing peptide identification
24. Riley NM, Coon JJ (2016) Phosphoproteo- events in proteomic workflows using data-
mics in the age of rapid and deep proteome dependent acquisition (DDA). Mol Cell Pro-
profiling. Anal Chem 88:74–94 teomics 13:329–338
25. Nilsson CL (2012) Advances in quantitative 38. Alcolea MP, Casado P, Rodrı́guez-Prados J-C
phosphoproteomics. Anal Chem 84:735–746 et al (2012) Phosphoproteomic analysis of
26. Hennrich ML, Gavin A-C (2015) Quantita- leukemia cells under basal and drug-treated
tive mass spectrometry of posttranslational conditions identifies markers of kinase path-
modifications: keys to confidence. Sci Signal way activation and mechanisms of resistance.
8:re5 Mol Cell Proteomics 11:453–466
27. Giansanti P, Aye TT, van den Toorn H et al 39. Cox J, Hein MY, Luber CA et al (2014)
(2015) An augmented multiple-protease- Accurate proteome-wide label-free quantifica-
based human phosphopeptide atlas. Cell Rep tion by delayed normalization and maximal
11:1834–1843 peptide ratio extraction, termed MaxLFQ.
28. Ruprecht B, Roesli C, Lemeer S et al (2016) Mol Cell Proteomics 13:2513–2526
MALDI-TOF and nESI Orbitrap MS/MS 40. Strittmatter EF, Ferguson PL, Tang K et al
identify orthogonal parts of the phosphopro- (2003) Proteome analyses using accurate
teome. Proteomics 16(10):1447–1456 mass and elution time peptide tags with capil-
29. Zhou H, Ye M, Dong J et al (2013) Robust lary LC time-of-flight mass spectrometry. J
phosphoproteome enrichment using mono- Am Soc Mass Spectrom 14:980–991
disperse microsphere-based immobilized tita- 41. Lange V, Picotti P, Domon B et al (2008)
nium (IV) ion affinity chromatography. Nat Selected reaction monitoring for quantitative
Protoc 8:461–480 proteomics: a tutorial. Mol Syst Biol 4:222
30. Rush J, Moritz A, Lee KA et al (2005) Immu- 42. Gillet LC, Navarro P, Tate S et al (2012)
noaffinity profiling of tyrosine phosphoryla- Targeted data extraction of the MS/MS spec-
tion in cancer cells. Nat Biotechnol tra generated by data-independent acquisi-
23:94–101 tion: a new concept for consistent and
31. Ruprecht B, Koch H, Medard G et al (2015) accurate proteome analysis. Mol Cell Proteo-
Comprehensive and reproducible phospho- mics 11:O111.016717
peptide enrichment using iron immobilized 43. Parker BL, Yang G, Humphrey SJ et al (2015)
metal ion affinity chromatography Targeted phosphoproteomics of insulin sig-
(Fe-IMAC) columns. Mol Cell Proteomics naling using data-independent acquisition
14:205–215 mass spectrometry. Sci Signal 8:rs6
32. Domon B, Aebersold R (2006) Mass spec- 44. Sidoli S, Fujiwara R, Kulej K et al (2016)
trometry and protein analysis. Science Differential quantification of isobaric phos-
(New York, NY) 312:212–217 phopeptides using data-independent acquisi-
33. Nesvizhskii AI (2007) Protein identification tion mass spectrometry. Mol BioSyst 12
by tandem mass spectrometry and sequence (8):2385–2388
database searching. Methods Mol Biol (Clif- 45. Keller A, Bader SL, Kusebauch U et al (2016)
ton, NJ) 367:87–119 Opening a SWATH window on posttransla-
34. Liu H, Sadygov RG, Yates JR (2004) A model tional modifications: automated pursuit of
for random sampling and estimation of rela- modified peptides. Mol Cell Proteomics
tive protein abundance in shotgun proteo- 15:1151–1163
mics. Anal Chem 76:4193–4201 46. Ong S-E, Blagoev B, Kratchmarova I et al
35. Cutillas PR, Vanhaesebroeck B (2007) Quan- (2002) Stable isotope labeling by amino
titative profile of five murine core proteomes acids in cell culture, SILAC, as a simple and
using label-free functional proteomics. Mol accurate approach to expression proteomics.
Cell Proteomics 6:1560–1573 Mol Cell Proteomics 1:376–386
36. Cutillas PR, Geering B, Waterfield MD et al 47. Zanivan S, Meves A, Behrendt K et al (2013)
(2005) Quantification of gel-separated pro- In vivo SILAC-based proteomics reveals
teins and their phosphorylation sites by phosphoproteome changes during mouse
LC-MS using unlabeled internal standards: skin carcinogenesis. Cell Rep 3:552–566
analysis of phosphoprotein dynamics in a B
48. Shenoy A, Geiger T (2015) Super-SILAC: 61. Baker PR, Trinidad JC, Chalkley RJ (2011)
current trends and future perspectives. Expert Modification site localization scoring
Rev Proteomics 12:13–19 integrated into a search engine. Mol Cell Pro-
49. Thompson A, Sch€afer J, Kuhn K et al (2003) teomics 10:M111.008078
Tandem mass tags: a novel quantification 62. Lemeer S, Heck AJR (2009) The phospho-
strategy for comparative analysis of complex proteomics data explosion. Curr Opin Chem
protein mixtures by MS/MS. Anal Chem Biol 13:414–420
75:1895–1904 63. Sharma K, D’Souza RCJ, Tyanova S et al
50. Ross PL, Huang YN, Marchese JN et al (2014) Ultradeep human phosphoproteome
(2004) Multiplexed protein quantitation in reveals a distinct regulatory nature of Tyr and
Saccharomyces cerevisiae using amine- Ser/Thr-based signaling. Cell Rep
reactive isobaric tagging reagents. Mol Cell 8:1583–1594
Proteomics 3:1154–1169 64. Olsen JV, Blagoev B, Gnad F et al (2006)
51. Li Z, Adams RM, Chourey K et al (2012) Global, in vivo, and site-specific phosphoryla-
Systematic comparison of label-free, meta- tion dynamics in signaling networks. Cell
bolic labeling, and isobaric chemical labeling 127:635–648
for quantitative proteomics on LTQ Orbitrap 65. Olsen JV, Vermeulen M, Santamaria A et al
Velos. J Proteome Res 11:1582–1590 (2010) Quantitative phosphoproteomics
52. Chelius D, Bondarenko PV (2002) Quantita- reveals widespread full phosphorylation site
tive profiling of proteins in complex mixtures occupancy during mitosis. Sci Signal 3:ra3
using liquid chromatography and mass spec- 66. Landry CR, Levy ED, Michnick SW (2009)
trometry. J Proteome Res 1:317–323 Weak functional constraints on phosphopro-
53. Neilson KA, Ali NA, Muralidharan S et al teomes. Trends Genet 25:193–197
(2011) Less label, more free: approaches in 67. Beltrao P, Trinidad JC, Fiedler D et al (2009)
label-free quantitative mass spectrometry. Evolution of phosphoregulation: comparison
Proteomics 11:535–553 of phosphorylation patterns across yeast spe-
54. Perkins DN, Pappin DJ, Creasy DM et al cies. PLoS Biol 7:e1000134
(1999) Probability-based protein identifica- 68. Beltrao P, Bork P, Krogan NJ et al (2013)
tion by searching sequence databases using Evolution and functional cross-talk of protein
mass spectrometry data. Electrophoresis post-translational modifications. Mol Syst
20:3551–3567 Biol 9:714
55. Clauser KR, Baker P, Burlingame AL (1999) 69. Newman RH, Zhang J, Zhu H (2014)
Role of accurate mass measurement (+/10 Toward a systems-level view of dynamic phos-
ppm) in protein identification strategies phorylation networks. Front Genet 5:263
employing MS or MS/MS and database 70. Glickman JF (2012) Assay development for
searching. Anal Chem 71:2871–2882 protein kinase enzymes. Eli Lilly & Company
56. MacCoss MJ, Wu CC, Yates JR (2002) and the National Center for Advancing Trans-
Probability-based validation of protein identi- lational Sciences, Bethesda, MD. http://
fications using a modified SEQUEST algo- www.ncbi.nlm.nih.gov/books/NBK91991/
rithm. Anal Chem 74:5593–5599 71. Perfetto L, Briganti L, Calderone A et al
57. Cox J, Neuhauser N, Michalski A et al (2011) (2016) SIGNOR: a database of causal rela-
Andromeda: a peptide search engine tionships between biological entities. Nucleic
integrated into the MaxQuant environment. Acids Res 44:D548–D554
J Proteome Res 10:1794–1805 72. Gnad F, Gunawardena J, Mann M (2011)
58. Beausoleil SA, Villén J, Gerber SA et al (2006) PHOSIDA 2011: the posttranslational modi-
A probability-based approach for high- fication database. Nucleic Acids Res 39:
throughput protein phosphorylation analysis D253–D260
and site localization. Nat Biotechnol 73. Hu J, Rho H-S, Newman RH et al (2014)
24:1285–1292 PhosphoNetworks: a database for human
59. Savitski MM, Lemeer S, Boesche M et al phosphorylation networks. Bioinformatics
(2011) Confident phosphorylation site locali- (Oxford, England) 30:141–142
zation using the Mascot Delta Score. Mol Cell 74. Sadowski I, Breitkreutz B-J, Stark C et al
Proteomics 10:M110.003830 (2013) The PhosphoGRID Saccharomyces
60. Chalkley RJ, Clauser KR (2012) Modification cerevisiae protein phosphorylation site data-
site localization scoring: strategies and perfor- base: version 2.0 update. Database 2013:
mance. Mol Cell Proteomics 11:3–14 bat026
75. Duan G, Li X, K€ ohn M (2015) The human 89. Chen EY, Tan CM, Kou Y et al (2013)
DEPhOsphorylation database DEPOD: a Enrichr: interactive and collaborative
2015 update. Nucleic Acids Res 43: HTML5 gene list enrichment analysis tool.
D531–D535 BMC Bioinformatics 14:128
76. Zhang H, Zha X, Tan Y et al (2002) Phospho- 90. Kuleshov MV, Jones MR, Rouillard AD et al
protein analysis using antibodies broadly reac- (2016) Enrichr: a comprehensive gene set
tive against phosphorylated motifs. J Biol enrichment analysis web server 2016 update.
Chem 277:39379–39387 Nucleic Acids Res 44(W1):W90–W97
77. Obenauer JC, Cantley LC, Yaffe MB (2003) 91. Lachmann A, Ma’ayan A (2009) KEA: kinase
Scansite 2.0: proteome-wide prediction of cell enrichment analysis. Bioinformatics (Oxford,
signaling interactions using short sequence England) 25:684–686
motifs. Nucleic Acids Res 31:3635–3641 92. Keshava Prasad TS, Goel R, Kandasamy K et al
78. C. Chen and B.E. Turk (2010) Analysis of (2009) Human Protein Reference Database—
serine-threonine kinase specificity using 2009 update. Nucleic Acids Res 37:
arrayed positional scanning peptide libraries., D767–D772
Curr Protoc Mol Biol Chapter 18:Unit 18.14 93. Huttlin EL, Jedrychowski MP, Elias JE et al
79. Sidhu SS, Koide S (2007) Phage display for (2010) A tissue-specific atlas of mouse protein
engineering and analyzing protein interaction phosphorylation and expression. Cell
interfaces. Curr Opin Struct Biol 17:481–487 143:1174–1189
80. Miller ML, Jensen LJ, Diella F et al (2008) 94. de Graaf EL, Giansanti P, Altelaar AFM et al
Linear motif atlas for phosphorylation- (2014) Single-step enrichment by Ti4+-
dependent signaling. Sci Signal 1:ra2 IMAC and label-free quantitation enables
81. Hjerrild M, Stensballe A, Rasmussen TE et al in-depth monitoring of phosphorylation
(2004) Identification of phosphorylation sites dynamics with high reproducibility and tem-
in protein kinase A substrates using artificial poral resolution. Mol Cell Proteomics
neural networks and mass spectrometry. J 13:2426–2434
Proteome Res 3:426–433 95. Wilm M, Mann M (1996) Analytical proper-
82. Linding R, Jensen LJ, Pasculescu A et al ties of the nanoelectrospray ion source. Anal
(2008) NetworKIN: a resource for exploring Chem 68:1–8
cellular phosphorylation networks. Nucleic 96. Wilkes EH, Terfve C, Gribben JG et al (2015)
Acids Res 36:D695–D699 Empirical inference of circuitry and plasticity
83. Szklarczyk D, Franceschini A, Wyder S et al in a kinase signaling network. Proc Natl Acad
(2015) STRING v10: protein-protein inter- Sci U S A 112:7719–7724
action networks, integrated over the tree of 97. T€urei D, Korcsmáros T, Saez-Rodriguez J
life. Nucleic Acids Res 43:D447–D452 (2016) OmniPath: guidelines and gateway
84. Wagih O, Sugiyama N, Ishihama Y et al for literature-curated signaling pathway
(2016) Uncovering phosphorylation-based resources. Nat Methods 13:966–967
specificities through functional interaction 98. Benjamini Y, Hochberg Y (2000) On the
networks. Mol Cell Proteomics 15:236–245 adaptive control of the false discovery rate in
85. Linding R, Jensen LJ, Ostheimer GJ et al multiple testing with independent statistics. J
(2007) Systematic discovery of in vivo phos- Educ Behav Stat 25:60–83
phorylation networks. Cell 129:1415–1426 99. Mckinney W (2010) Data structures for sta-
86. Subramanian A, Tamayo P, Mootha VK et al tistical computing in python. Proceedings of
(2005) Gene set enrichment analysis: a the 9th python in science conference
knowledge-based approach for interpreting 100. Van Der Walt S, Colbert SC, Varoquaux G
genome-wide expression profiles. Proc Natl (2011) The NumPy Array: A Structure for
Acad Sci U S A 102:15545–15550 Efficient Numerical Computation, Comput
87. Schacht T, Oswald M, Eils R et al (2014) Sci Eng 13:22–30. https://doi.org/10.
Estimating the activity of transcription factors 1109/MCSE.2011.37
by the effect on their target genes. Bioinfor- 101. Kim S-Y, Volsky DJ (2005) PAGE: parametric
matics (Oxford, England) 30:i401–i407 analysis of gene set enrichment. BMC Bioin-
88. Drake JM, Graham NA, Stoyanova T et al formatics 6:144
(2012) Oncogene-specific activation of tyro- 102. Jones E, Oliphant TE, Peterson P (2007)
sine kinase networks during prostate cancer Python for scientific computing. Comput Sci
progression. Proc Natl Acad Sci Eng 9:10–20
109:1643–1648 103. Imamura H, Sugiyama N, Wakabayashi M
et al (2014) Large-scale identification of
phosphorylation sites for profiling protein resource for post-translational modification

kinase selectivity. J Proteome Res of proteins. Nucleic Acids Res 44:
13:3410–3419 D435–D446
104. Newman RH, Hu J, Rho H-S et al (2013) 112. Wagih O, Reimand J, Bader GD (2015)
Construction of human activity-based phos- MIMP: predicting the impact of mutations
phorylation networks. Mol Syst Biol 9:655 on kinase-substrate phosphorylation. Nat
105. Creixell P, Palmeri A, Miller CJ et al (2015) Methods 12:531–533
Unmasking determinants of specificity in the 113. Raza S, McDerment N, Lacaze PA et al
human kinome. Cell 163:187–201 (2010) Construction of a large scale
106. Terfve CDA, Wilkes EH, Casado P et al integrated map of macrophage pathogen rec-
(2015) Large-scale models of signal propaga- ognition and effector systems. BMC Syst Biol
tion in human cells derived from discovery 4:63
phosphoproteomic data. Nat Commun 114. T€urei D, Papp D, Fazekas D et al (2013)
6:8033 NRF2-ome: an integrated web resource to
107. Creixell P, Schoof EM, Simpson CD et al discover protein interaction and regulatory
(2015) Kinome-wide decoding of network- networks of NRF2. Oxidative Med Cell
attacking mutations rewiring cancer signaling. Longev 2013:737591
Cell 163:202–217 115. Paz A, Brownstein Z, Ber Y et al (2011)
108. Hernandez-Armenta C, Ochoa D, Goncalves SPIKE: a database of highly curated human
E et al (2016) Benchmarking substrate-based signaling pathways. Nucleic Acids Res 39:
kinase activity inference using phosphopro- D793–D799
teomic data. Bioinformatics 33 116. Fazekas D, Koltai M, T€ urei D et al (2013)
(12):1845–1851 SignaLink 2 - a signaling pathway resource
109. T€urei D, Földvári-Nagy L, Fazekas D et al with multi-layered regulatory networks.
(2015) Autophagy Regulatory Network - a BMC Syst Biol 7:7
systems-level bioinformatics resource for 117. Chun JN, Lim JM, Kang Y et al (2014) A
studying the mechanism and regulation of network perspective on unraveling the role
autophagy. Autophagy 11:155–165 of TRP channels in biology and disease. Pflu-
110. Ma’ayan A, Jenkins SL, Neves S et al (2005) gers Arch 466:173–182
Formation of regulatory patterns during sig- 118. Cokelaer T, Pultz D, Harder LM et al (2013)
nal propagation in a Mammalian cellular net- BioServices: a common Python package to
work. Science (New York, NY) access biological Web Services programmati-
309:1078–1083 cally. Bioinformatics 29:3241–3242
111. Huang K-Y, Su M-G, Kao H-J et al (2016)
dbPTM 2016: 10-year anniversary of a
were made.
Chapter 7
Perseus: A Bioinformatics Platform for Integrative Analysis

of Proteomics Data in Cancer Research
Stefka Tyanova and Juergen Cox
Abstract
Mass spectrometry-based proteomics is a continuously growing field marked by technological and meth-
odological improvements. Cancer proteomics is aimed at pursuing goals such as accurate diagnosis, patient
stratification, and biomarker discovery, relying on the richness of information of quantitative proteome
profiles. Translating these high-dimensional data into biological findings of clinical importance necessitates
the use of robust and powerful computational tools and methods. In this chapter, we provide a detailed
description of standard analysis steps for a clinical proteomics dataset performed in Perseus, a software for
functional analysis of large-scale quantitative omics data.
Key words Perseus, Software, Omics data analysis, Translational bioinformatics, Cancer proteomics
1 Introduction
High-resolution mass spectrometry-based proteomics, aided by

computational sciences, is continuously pushing the boundaries of
systems biology. Obtaining highly accurate quantitative proteomes
on a genome-wide scale is becoming feasible within realistic mea-
surement times [1]. Similar to the clinical goals of genomics and
transcriptomics to provide a deeper understanding of a certain
disease that goes beyond the standard clinical parameters of cancer
diagnosis, proteomics offers a comprehensive view of the molecular
players in a cell at a particular moment and in a specific state
[1]. The maturation of the technology together with the develop-
ment of suitable methods for quantification of human tissue pro-
teomes [2–4] has opened new doors for employing proteomics in
medical applications and is shaping the growing field of clinical
proteomics [5, 6]. Following these advances, proteomic approaches
have been used to address multiple clinical questions in the context
of various cancer types. The major area of application is the
https://doi.org/10.1007/978-1-4939-7493-1_7, © The Author(s) 2018
133
134 Stefka Tyanova and Juergen Cox
e1
pl
m
Sa
Expression data table Data integration Knowledge generation
Statistical analysis Functional analysis

-log p value
significant
0
difference min maxx
Fig. 1 Outline of a typical analysis workflow in Perseus. The workflow shows the process of converting data
into information and knowledge. Statistical analysis can be used to guide the identification of biologically
relevant hits and drive hypotheses generation. Various external databases, annotation sources, and multiple
omics types can be loaded and matched within the software and together with powerful enrichment
techniques allow for smooth data integration
profiling of cancer-relevant tissues—including the proteomes of

colorectal cancer [7, 8] and prostate cancer [9], as well as the
subtyping of lymphoma [10] and breast cancer [11, 12] patients.
Although proteomics has become an extremely powerful approach
for studying biomedical questions, offering unique advantages
compared to other omics techniques, the functional interpretation
of the vast amounts of data of a typical proteomics experiment often
poses analytical challenges to the biological domain experts.
The aim of data analysis is to translate large amounts of pro-
teomic data that cover numerous samples, conditions and time
points into structured, domain-specific knowledge that can guide
clinical decisions (Fig. 1). Prior to any statistical analysis, data
Perseus for Systems Analysis of Cancer Proteomics Data 135
cleansing is usually performed which includes normalization, to

ensure that different samples are comparable, and missing value
handling to enable the use of methods that require all data points
to be present. A plethora of imputation methods developed for
microarray data [13] can be applied to proteomics as well
[14]. Among these, methods with the underlying assumption that
missing values result from protein expression that lies under the
detection limit of modern mass spectrometers are frequently used.
A typical task of clinical proteomics studies is to identify proteins
that show differential expression between healthy and diseased
states or between different subtypes of a disease. Although com-
monly established statistical methods, which achieve this task exist,
distinguishing between expression differences due to technical
variability, genetic heterogeneity, or even intra-sample variability
and true disease-related changes require deep knowledge of statis-
tical tools and good understanding of the underlying problems in
the analysis of omics data.
For instance, testing thousands of proteins for differential
expression is hampered by the multiple hypothesis-testing prob-
lem, which results in an increased probability of calling a protein a
significant hit when there is no actual difference in expression.
This necessitates the use of correction methods to increase the
confidence of the identified hits. The choice of the appropriate
correction method depends on the balance between wrongly
accepted hits (error type I) and wrongly rejected hits (error type
II) that an experimentalist is willing to accept. For instance,
permutation-based FDR [15] has a reduced error type II rate
compared to the Benjamini-Hochberg correction [16]. Once the
initial list of quantified proteins is narrowed down to only the
significantly changing hits the question of their functional rele-
vance arises. Enrichment analysis of protein annotations is the
preferred method for deriving functional implications of sets of
proteins and is applicable to both categorical (Fisher’s exact test
[17]) and expression/numerical data (1D enrichment test [18]).
The outcome of such an analysis often offers a comprehensive
view of the biological roles of the selected proteins through high-
lighting key pathways and cellular processes in which they are
involved.
In this chapter, we provide a step-by-step workflow of bioin-
formatic analysis of proteomics data of luminal-type breast cancer
progression. Commonly used analytical practices are described
including data cleansing and preprocessing, exploratory
analysis, statistical methods and guidelines, as well as functional
enrichment techniques. All the steps are implemented as processes
in Perseus [19], a comprehensive software for functional analysis of
omics data.
2 Materials
2.1 Software Written in C#, Perseus achieves optimal performance when run on
Download and Windows operating systems. The latest versions require 64 bit
Installation system and .NET Framework 4.5 to be installed on the same
computer. To use the software on MacOS set up BootCamp and
optionally in addition Parallels. Registration and acceptance of the
Software License Agreement are required prior to downloading
Perseus from the official website: http://www.coxdocs.org/doku.
php?id¼perseus:start . Once the download has finished, decom-
press the folder, locate the Perseus.exe file, and double-click it to
start the program.
2.2 Data Files In the subsequent analysis, we used a subset of the data measured
by Pozniak et al. [20]. The authors provide a genome-wide pro-
teomic analysis of progression of breast cancer in patients by study-
ing major differences at the proteome level between healthy,
primary tumor, and metastatic tissues. The data were measured as
ratios between an optimized heavy-labeled mix of cell lines repre-
senting different breast cancer stages and the patient proteome
[2]. This constitutes an accurate relative quantification approach
used especially in the analysis of clinical samples. Peptide and pro-
tein identification and quantification was performed using the
MaxQuant suite for the analysis of raw mass spectrometry data
[21] at peptide spectrum match and protein false discovery rate of
1%. The subset used in this protocol contains proteome profiles of
22 healthy, 21 lymph node negative, and 25 lymph node metastatic
tissue samples and spans over 10,000 protein groups and can be
found in the proteinGroups.txt file provided as supplementary
material to the Pozniak et al. study (see Note 1).
3 Methods
The Methods section contains several modules covering the most

frequently performed steps in the analysis of proteomics data.
Often, a proteomics study benefits from a global overview of the
data, which usually includes the total number of identified and
quantified proteins, dynamic range, coverage of specific pathways,
and groups of proteins. A good practice in data analysis is to start
with exploratory statistics in order to check for biases in the data,
undesirable outliers, and experiments with poor quality data and to
make sure that all requirements for performing the subsequent
statistical tests are met. Once the data are filtered and normalized
appropriately, statistical and bioinformatic analyses are performed
in order to identify proteins that are likely to be functionally-
important. When the list of such proteins is small enough and direct
links to the question of interest can be inferred using prior knowl-

edge, follow-up experiments can be performed after this step to
confirm the results of the statistical analysis. However, one of the
advantages of mass spectrometry-based proteomics is the ability to
unravel new discoveries in an unbiased way, for instance, through
functional analysis. This analysis is often based on enrichment tests,
which can highlight guiding biological processes and mechanisms.
3.1 Loading the Data 1. Go to the “Load” section in Perseus and click the “Generic
matrix upload” button.
2. In the pop-up window, navigate to the file to be loaded (see
Note 2).
3. Select all the expression columns and transfer them to the Main
columns window (see Note 3). Select all additional numerical
data that may be needed in the analysis and transfer them to the
Numerical columns window. Make sure that the columns con-
taining identifiers (e.g., protein IDs) are selected as Text col-
umns. Click ok.
3.2 Summary Get familiar with the Software and its five main sections: Load,
Statistics Processing, Analysis, Multi-processing, and Export (see Fig. 2).
1. In the workflow panel, change the name of the data matrix from
matrix 1 to InitialData by right-clicking the node and changing
the Alternative name box. Close the pop-up window. Explore
the right-most panel of Perseus, which contains useful informa-
tion such as number of main columns and number of rows.
2. Go to “Processing ! Filter rows ! Filter rows based on
categorical column” to exclude proteins identified by site,
matching to the reverse database or contaminants (see Note 4).
3. Transform the data to a logarithmic scale by going to “Proces-
sing ! Basic ! Transform” and specifying the transformation
function (e.g., log2(x)).
4. In the “Processing” section, select the “Basic” menu and click
on the “Summary statistics (columns)” button. Select all
expression columns by transferring them to the right-hand
side. Click ok and explore the new matrix.
3.3 Filtering 1. Use the workflow window to select the InitialData matrix data
by clicking on it (see Note 5).
2. In the “Processing” section, go to the “Filter rows” menu and
select “Filter rows based on valid values.” Change the Min.
valids parameter to Percentage and keep the default value of
70% for the Min. percentage of values parameter. Click ok.
Check how many protein groups were retained after the filter-
ing (see Note 6).
A
LOAD EXPORT PROCESSING ANALYSIS MULTI-PROC.
Expression data Data matrix Quality control Visualization Match by row

Gene list Result matrix Normalization Clustering Match by column
NGS data Statistics PCA
Annotation data Enrichment
Random data PTMs
B
Treatment Annotation
Technical replicates r1 r2 r3 r1 r2 r3
Numerical variable, e.g. BMI 32 26 23 27 23 22 Numerical
C D
s
ot ID y
Pr ein hwa
e
m
GO lue e
va nc
na
GG s
ot at
KE erm
Q- nda
e
e
n
Pr p
n
i
pl
pl
ei
t
m
u
Ab
Sa
Sa
Proteinj profile Proteinj data
Proteinm profile Proteinm data
Numerical Text
Categorical
Fig. 2 Interfaces of Perseus and the augmented data matrix format. (a) Perseus extends over five interfaces,
each of which includes various analysis and transformation functionalities and visualization possibilities. (b)
Experimental design is specified as annotation (e.g., treatment vs. control groups) or numerical rows (e.g.,
variable concentration). Multiple annotation rows can be specified that allow biological and technical
replicates to be analyzed together. (c) The data is organized in a matrix format where typically all samples
are displayed as columns and all proteins as rows. (d) Additional protein information can be added in the form
of Numerical, Categorical, or Text annotations
3.4 Exploratory 1. To visually inspect the data, go to “Analysis ! Visualization !

Analysis Histograms.” Select all the samples of interest by transferring
them to the right-hand side. Click ok.
2. Explore the visualization options in the Histogram panel by
testing the functionality of each of the buttons (e.g., Properties,
Fit width, Fit height).
3. Click on the pdf button to export the plot (see Note 7).
4. Switch the view to the “Data” tab.
5. Go to “Analysis ! Visualization ! Multi scatter plot.” Select
the desired samples by transferring them to the right-hand side.
Click ok (see Fig. 3).
6. Adjust the plot using the Fit width and Fit height options and
resizing the plot window.
7. In the drop-down menu “Display in plots” in the plot window,
select Pearson correlation.
8. Select a scatter plot by clicking on it. The selected plot will be
shown in an enlarged view.
9. Select a number of proteins from the “Point” table on the right
of the multi scatter plot and examine their position in all pair-
wise sample comparisons.
10. Switch back to the “Data” tab to continue with the analysis.
11. “Go to Processing ! Basic ! Column correlation.” Make sure
that the Type is set to Pearson correlation. The output table
contains all pairwise correlations between the selected
columns.
12. To visualize the sample correlations, go to “Analysis ! Clus-
tering/PCA ! Hierarchical clustering.” Use the Change color
gradient to set a continuous gradient similar to the one in
Fig. 3a.
13. Export the plot by clicking on the pdf button.
14. Navigate back to the previous data matrix by clicking on it in
the workflow panel.
15. Principal component analysis requires all values to be valid. To
remove all protein groups with missing values, repeat Subhead-
ing 3.3, step 2 setting the percentage parameter to 100 (see
Note 8).
16. Go to “Analysis ! Clustering/PCA ! Principal component
analysis” and click ok. Explore the sample separation (dot plot
in the upper panel) and the corresponding loadings (dot plot in
the lower panel).
17. In the table on the right of the PCA plot, select a set of samples
(e.g., all samples that belong to one experimental condition)
and change their color by clicking on the Symbol color button
and selecting the desired color.
18. Check the contribution of other components by substituting
Component 1 and 2 with other components from the drop-
down menu. Find the components that show sample separa-
tion according to the experimental conditions (see Fig. 3c).
A B
B18M

B15M
B13M
B25M
B21M
B28M
B24M
A38T
A33T
B22M
A35T
A34T
A32T
A30T
A31T
B19M
A36T
B23H
B23M
B26M
B16M
A15T
A19T
B2M
A8T
B3M
B9M
B5M
B6M
B1M
B17M
B12M
B14M
A9T
A7T
B11M
B10M
B7M
A16T
A6T
A5T
A3T
B24H
A12T
A18T
A11T
A19H
B27M
A31H
B22H
A36H
A35H
A34H
B26H
B19H
A30H
B9H
B2H
B5H
A16H
B8H
A15H
A14H
B21H
B3H
A3H
A13T
A11H
5
+HDOWK\/+
B18M
B15M
B13M
0
B25M
B21M
B28M
B24M
A38T
A33T
B22M
A35T
A34T
A32T
A30T
A31T
B19M
A36T
B23H
B23M
3ULPDU\/+
B26M
5
B16M
A15T
A19T
B2M
A8T
B3M
B9M
B5M
B6M
B1M
B17M
0
B12M
B14M
A9T
A7T
B11M
B10M
B7M
A16T
A6T
A5T
A3T
B24H

A12T
5
A18T
0HWDVWDVLV+/
A11T
A19H
B27M
A31H
B22H
A36H
A35H
A34H
B26H
B19H
A30H
B9H
0
B2H
B5H
A16H
B8H
A15H
A14H
B21H
B3H
A3H
A13T
A11H
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 5 0 5 0 5
+HDOWK\/+ 3ULPDU\/+ 0HWDVWDVLV+/
Correlation coefficient
C
20
Healthy
&RPSRQHQW
10
Primary tumors
Metastasis
0
-10
-20 -15 -10 -5 0 5 10 15 20

&RPSRQHQW
Fig. 3 Exploratory analysis outputs in Perseus. (a) Hierarchical clustering of all the samples based on the
correlation coefficients between them reveals higher similarity between primary and metastatic tumors versus
healthy tissue samples. (b) Multi-scatter plot of averaged profiles among the three main groups clearly
represents the disease progression by highlighting strong similarities between subsequent stages, e.g.,
healthy tissue samples are more similar to primary tumors than to metastasis (correlation coefficient 0.76
vs 0.69), whereas primary tumors are most similar to metastasis (R ¼ 0.94). The category Cell division is
highlighted in bright green in all pairwise comparison plots. (c) Principal component analysis (PCA) attributes
the largest variance to the difference between healthy (blue dots) and cancer tissues (pink and red dots)
(Component 1, 21.1%) and shows that primary and metastatic tumors (pink and red dots respectively) are
difficult to distinguish
19. Explore the proteins driving this separation. In the loadings

plot beneath the PCA, change the selection Mode to rectangu-
lar selection. Hold the left mouse key down and draw a rectan-
gle around the dots in the upper right corner and then release
the mouse. The selected proteins are highlighted in the table to
the right and their labels are displayed in the plot.
3.5 Normalization 1. Navigate back to the data matrix before filtering for 100% valid
values (Subheading 3.3, step 2).
2. Go to “Processing ! Normalization ! Z-score.” Change the
Matrix access parameter to Columns and select the Use median
option. In the new data table, plot histograms for the same
subset of samples as in Subheading 3.4, step 1 (see Note 9).
3.6 Experimental 1. Go to “Processing ! Annot. rows ! Categorical annotation

Design rows.” Use the Create action option to manually specify the
experimental condition to which a sample belongs (i.e., indi-
cate control versus stimulus, or different stages of a disease). All
the samples belonging to one condition should have the same
annotation. A new row will be added under the column names
in the newly generated data matrix (see Note 10).
3.7 Loading 1. Go to the drop-down menu indicated with a white arrow at the
Annotations top left corner of Perseus and select “Annotation download.”
2. Click on the link in the pop-up window. Select the appropriate
annotation file (e.g., “PerseusAnnotaion ! FrequentlyUsed !
mainAnnot.homo_sapiens.txt.gz,” if the organism of interest is
homo sapiens).
3. Download the file to the Perseus/conf/annotations folder.
4. Go to “Processing ! Annot. columns ! Add annotation.”
Select the file from the previous step as a Source.
5. Set the UniProt column parameter to the column that
contains UniProt identifiers. These identifiers will be used for
overlaying the annotation data with the expression matrix (e.g.,
Protein IDs).
6. Select several categories of interest to be overlaid with the main
matrix and move them to the right-hand side. Click ok.
3.8 Differential 1. Go to “Processing ! Tests.” From the menu select the appro-
Expression Analysis priate test. For the data set used in this chapter, the Multiple-
sample tests option should be chosen, as there are more than
two conditions that are compared. The default parameters do
not have to be changed (see Note 11).
2. Specify the categorical row that contains information about the
experimental conditions of the samples that will be used in the
differential analysis in the Grouping parameter.
3. Keep the default value of 0 for the S0 parameter, to use the
standard t-test statistic. Change the parameter to use the mod-
ified test statistic approach described by Tusher et al. [15].
4. Select the multiple hypothesis testing correction method to
be used by specifying the Use for truncation parameter (see
Note 12, Fig. 4a).
A
Randomize, r q-vali <= FDR threshold
Protein Group 1 Group 2 Group 3 Group 1 Group 2 Group 3 Significant

A ***
B
C ***
D ***
E
F ***
Compute p-valsobserved Compute p-valsrandom
Protein |m1 - m2| >= THSD Sign. |m1 - m3| >= THSD Sign. |m2 - m3| >= THSD Sign.
A *** ***
C *** ***
D *** ***
F *** ***
Fig. 4 Differential expression and multiple hypothesis testing. (a) Multiple hypothesis testing correction using a
permutation-based false discovery rate approach is shown. Labels are randomly swapped between the three
groups (blue, yellow, and red). The Randomization is repeated r times. ANOVA p-values are computed both on
the measured and the permutated data and local FDR values (q-values) are computed as the fraction of
accepted hits from the permuted data over accepted hits from the measured data normalized by the total
number of randomizations r. All hits with a q-value smaller than a threshold are considered significant. (b) To
determine the exact pairwise differences of protein expression Tukey’s Honest Significant Difference (THSD)
test is used on the ANOVA significant hits. If the mean difference between two groups is greater than or equal
to the corresponding THSD, the difference is considered significant between the compared groups. q: constant
depending on the number of treatments and the degrees of freedom that can be found in a Studentized range q
table; MSE: mean squared error; n1, n2, number of data points in each group
5. Specify if a suffix should be added to the output columns

produced by Perseus. This option is relevant when multiple
tests are conducted, e.g., with different parameter settings, as it
helps to distinguish between them in the output table.
6. Inspect the output table. It contains three new columns:
ANOVA significant, Log ANOVA p-value, and ANOVA
q-value (see Note 13).
7. Go to “Processing ! Filter rows ! Filter rows based on
categorical column.” Set the Column parameter to ANOVA
Significant and the Mode parameter to Keep matching rows to
retain all differentially expressed proteins.
8. Go to “Processing ! Tests ! Post-hoc tests.” Set the Group-
ing parameter to the same grouping that was used for the
ANOVA test (see Subheading 3.6, step 1) and the FDR to
the desired threshold. Tukey’s honestly significant difference

(THSD) is computed for all proteins and all pairwise compar-
isons and the significant hits within the corresponding pairs are
marked (see Note 14, Fig. 4b).
3.9 Clustering and 1. Go to “Analysis ! Clustering/PCA ! Hierarchical cluster-

Profile Plots ing.” Keep the default parameters and click ok.
2. Inspect the resulting heatmap and the relationship between the
groups and the proteins.
3. Click on the Change color gradient button in the button ribbon
above the heatmap to examine the color scale usage (red means
high and green low expression) and to modify them.
4. Click on several node junctions in the protein tree that repre-
sent potentially interesting clusters of proteins (i.e., upregula-
tion in a certain experimental condition). The selected clusters
are highlighted and appear in the “Row clusters” table dis-
played to the right of the heatmap (see Note 15).
5. Inspect the different profile plots as you navigate through the
different clusters in the table. Change the color by modifying
the Color scale and export the profile plots by clicking on the
Export image button (see Fig. 5).
6. From the ribbon menu in the heat map view, click on the
Export row clustering button to add the cluster information to
a new data matrix.
3.10 Functional 1. Go to “Multi-proc. ! Matching rows by name.” Both Base

Analysis and Other matrices point to the last matrix.
2. Click on Base matrix and then in the workflow window select
the data matrix that was generated before filtering for ANOVA
significant hits (Subheading 3.9, step 6).
3. In the pop-up window set Matching column in matrix 1 and
2 to a common identifier (e.g., Protein IDs).
4. In the categorical columns section, transfer the category Clus-
ter to the right hand-side. Click ok (see Note 16).
5. Go to “Processing ! Annot. columns ! Fisher exact test.”
Change the Column parameter to Cluster and click ok. The
resulting table contains information about all annotation cate-
gories that were found to be significantly enriched or depleted
using a Fisher’s exact test and multiple hypotheses correction
(see Note 17).
In summary, this chapter provides a complete protocol for
fundamental analysis of proteomic data, starting from data
upload and transformation and ending with identification of
proteins, characteristic of the specific disease progression stage,
and the underlying processes in which they are involved. The
A B C
Metastasis
Primary tumor
Healthy
or
m
s
tu
si
ta
y
ar
lth
as
im
ea
et
Pr
M
H
1 Cytoplasmic translation 9.3 0.3E-02
Arp2/3 complex 8.6 0.1E-02
Cytosolic small ribosomal subunit 8.5 1.2E-17
Actin nucleation 7.8 0.3E-02
1 mRNA catabolic process 4.8 3.7E-21

2
Persoxisome 4.1 0.8E-02
Large ribosomal subunit 3.6 0.5E-02
Oxidative phosphorylation 3.3 0.5E-02

Ribosome 3.0 0.01
2 Vesicle 1.4 0.2E-03
3
Doane breast cancer ESR1 DN 6.2 0.1E-02
Epidermis development 4.9 3.7E-06
Basolateral plasma membrane 4.3 0.2E-03
Cell-cell adhesion 3.1 0.2E-02
3 ncRNA metabolic process 2.2 0.4E-02
–6 0 6 0.1 0.3 0.5
Fig. 5 Enrichment analysis highlighting important pathways and processes. (a) Hierarchical clustering of
proteins found to have differential expression between pairs of disease states. High and low expression are
shown in red and blue respectively. Various clusters of protein groups are highlighted in the dendrogram. (b)
Profile plots of three selected clusters showing distinct behavior with respect to the three disease states are
shown: 1 strongly increased expression in tumor tissues; 2 moderate increase in tumor tissue; and
3 decreased expression in tumor samples. (c) Functional analysis of protein annotation terms resulted in
multiple categories that were enriched in the three selected clusters. The enriched terms and the
corresponding enrichment factor and p-value are shown
described analytical methods and visualization tools are

integrated in Perseus, a freely available platform for analysis of
omics data, which provides a comprehensive portfolio of analy-
sis tools with a user-friendly interface [19]. A special emphasis
is placed on employing statistically sound methods in the anal-
ysis of large data, avoiding wrong interpretation and extracting
maximum information. More advanced computational techni-
ques such as supervised learning are also supported and are
often instrumental for the analysis of complex data where
genetic and intra-tumor variability may pose challenges. More-
over, Perseus is being continuously developed to integrate
analysis of various data types, including posttranslation mod-
ifications, sequence information, as well as to allow deeper
functional interpretation through network and pathways
analysis.
4 Notes
1. Proteins with shared peptides that cannot be distinguished

based on the peptides identified in a bottom-up proteomics
approach are often reported together as a protein group [21].
2. The input file format of Perseus is a tab-delimited txt file that
contains a header row with the names of all columns. The type
of data is specified during file loading. Make sure that the
“Regional and Language Options” are set to English to avoid
errors while reading numerical data such as decimal separators
being wrongly interpreted.
3. Different expression and meta data can be imported in Perseus
and used for subsequent analysis. Common expression data are
in one of the following formats: normalized intensities (e.g.,
LFQ intensity as described in [4], iBAQ as described in [22])
or ratios between heavy standard and light/non-labeled sam-
ple. Other data types that can be analyzed with Perseus are
shown in Fig. 2.
4. Reverse, identified by site and contaminant proteins have to be
marked in a categorical column before these filters can be
applied. These are automatically set when MaxQuant output
tables are used for analysis in Perseus. Additional filtering
options can be used to remove proteins based on a quantitative
measure such as a minimum number of quantified peptides or a
maximum q-value.
5. Different activities have different output results including a
data matrix with the same expression values and additional
columns containing the results of the analysis or a new data
matrix containing only the output of an analysis activity. An
activity is always performed on the data matrix and specific tab
for that matrix that is active at the moment.
6. Depending on the nature of missing values, different filtering
strategies may be employed and are supported in Perseus. For
example, if large differences between groups are expected with
proteins having very low expression level in one of the groups,
filtering based on a minimum number of valid values in at least
one group would be a more suitable approach than filtering for
a minimum number of valid values in the complete matrix.
7. All the plots can be exported in figure-ready formats such as
pdf, tiff, or png.
8. Very stringent filtering is usually not recommended, as a large
amount of the data will be lost. Instead milder filtering com-
bined with imputation may be more appropriate.
9. Data normalization is not always necessary. Different types of
normalization can be applied on the data to correct for system-
atic shifts or skewness and to make samples comparable.
10. Regular expressions can be used to derive the experimental

design from the sample names (“Action ! Create from exper-
iment name”). Additionally, a template txt file can be written
out, edited in an external editor program, and read in to
indicate the experimental design.
11. Analysis of differentially expressed proteins depends on the
number of compared conditions, the underlying distribution
properties, and the availability of biological replicates. For
example, data sets with one condition with replicates should
be analyzed with One-sample tests, with two conditions—with
Two-sample tests, and with more than two conditions—Multi-
ple-sample tests. Paired samples test and tests abolishing the
requirement for equal variance are also available.
12. The method with largest power Permutation-based FDR is
recommended and at least 250 repetitions are suggested. In
case of technical replicates, these have to be specified as a
separate grouping (see Subheading 3.7, step 1) and selected
with the “Preserve grouping in randomizations” option. Fail-
ure to specify technical replicates will result in wrong FDR
calculation. The more conservative Benjamini-Hochberg cor-
rection can also be used when a lower number of Type I errors
at the cost of lower sensitivity are desired.
13. The “Significant” column contains a “+” if a protein met the
selected significance threshold (usually q-value). Additionally,
p-values (probability of type I error) and the corresponding
q-values (corrected p-value) are provided in the output table.
14. Tukey’s honestly significant difference (THSD) is a post-hoc
test that when performed on ANOVA significant hits deter-
mines in exactly which pairwise group comparisons a given
protein is differentially expressed.
15. Clusters can be defined by clicking on the respective nodes in
the protein tree or based on the precise distance measure used
to compute the tree. To use the latter option, click on the
“Define row clusters” button and specify the desired number
of clusters, which will then be automatically defined.
16. The matching step is necessary in order to define the correct
background against which enrichment will be computed. Too
small (only significant hits) or too large (the complete prote-
ome, even if not detected with MS analysis) introduces bias in
the enrichment results.
17. The enrichment output table contains information about the
values used to compute the contingency table for the Fisher’s
exact test (e.g., category and intersection size), the enrichment
factor, and the statistical significance of the hit indicated by a p-
value and the associated false discovery rate.
References
1. Mann M, Kulak NA, Nagaraj N, Cox J (2013) Geiger T, Mann M, Flores-Morales A (2016)
The coming age of complete, accurate, and The proteome of primary prostate cancer. Eur
ubiquitous proteomes. Mol Cell 49 Urol 69(5):942–952. https://doi.org/10.
(4):583–590. https://doi.org/10.1016/j. 1016/j.eururo.2015.10.053
molcel.2013.01.029 10. Deeb SJ, Tyanova S, Hummel M, Schmidt-
2. Geiger T, Cox J, Ostasiewicz P, Wisniewski JR, Supprian M, Cox J, Mann M (2015) Machine
Mann M (2010) Super-SILAC mix for quanti- learning based classification of diffuse large
tative proteomics of human tumor tissue. Nat B-cell lymphoma patients by their protein
Methods 7(5):383–385. https://doi.org/10. expression profiles. Mol Cell Proteomics 14
1038/nmeth.1446 (11):2947–2960. https://doi.org/10.1074/
3. Shenoy A, Geiger T (2015) Super-SILAC: cur- mcp.M115.050245
rent trends and future perspectives. Expert Rev 11. Tyanova S, Albrechtsen R, Kronqvist P, Cox J,
Proteomics 12(1):13–19. https://doi.org/10. Mann M, Geiger T (2016) Proteomic maps of
1586/14789450.2015.982538 breast cancer subtypes. Nat Commun
4. Cox J, Hein MY, Luber CA, Paron I, 7:10259. https://doi.org/10.1038/
Nagaraj N, Mann M (2014) Accurate ncomms10259
proteome-wide label-free quantification by 12. Mertins P, Mani DR, Ruggles KV, Gillette MA,
delayed normalization and maximal peptide Clauser KR, Wang P, Wang X, Qiao JW, Cao S,
ratio extraction, termed MaxLFQ. Mol Cell Petralia F, Kawaler E, Mundt F, Krug K, Tu Z,
Proteomics 13(9):2513–2526. https://doi. Lei JT, Gatza ML, Wilkerson M, Perou CM,
org/10.1074/mcp.M113.031591 Yellapantula V, Huang KL, Lin C, McLellan
5. Ellis MJ, Gillette M, Carr SA, Paulovich AG, MD, Yan P, Davies SR, Townsend RR, Skates
Smith RD, Rodland KK, Townsend RR, SJ, Wang J, Zhang B, Kinsinger CR, Mesri M,
Kinsinger C, Mesri M, Rodriguez H, Liebler Rodriguez H, Ding L, Paulovich AG, Fenyo D,
DC, Clinical Proteomic Tumor Analysis C Ellis MJ, Carr SA, Nci C (2016) Proteoge-
(2013) Connecting genomic alterations to can- nomics connects somatic mutations to signal-
cer biology with proteomics: the NCI Clinical ling in breast cancer. Nature 534(7605):55–62.
Proteomic Tumor Analysis Consortium. Can- https://doi.org/10.1038/nature18003
cer Discov 3(10):1108–1112. https://doi. 13. Troyanskaya O, Cantor M, Sherlock G,
org/10.1158/2159-8290.CD-13-0219 Brown P, Hastie T, Tibshirani R, Botstein D,
6. Hanash S, Taguchi A (2010) The grand chal- Altman RB (2001) Missing value estimation
lenge to decipher the cancer proteome. Nat methods for DNA microarrays. Bioinformatics
Rev Cancer 10(9):652–660. https://doi.org/ 17(6):520–525
10.1038/nrc2918 14. Lazar C, Gatto L, Ferro M, Bruley C, Burger T
7. Wisniewski JR, Dus-Szachniewicz K, (2016) Accounting for the multiple natures of
Ostasiewicz P, Ziolkowski P, Rakus D, Mann missing values in label-free quantitative prote-
M (2015) Absolute proteome analysis of colo- omics data sets to compare imputation strate-
rectal mucosa, adenoma, and cancer reveals gies. J Proteome Res 15(4):1116–1125.
drastic changes in fatty acid metabolism and https://doi.org/10.1021/acs.jproteome.
plasma membrane transporters. J Proteome 5b00981
Res 14(9):4005–4018. https://doi.org/10. 15. Tusher VG, Tibshirani R, Chu G (2001) Sig-
1021/acs.jproteome.5b00523 nificance analysis of microarrays applied to the
8. Zhang B, Wang J, Wang X, Zhu J, Liu Q, ionizing radiation response. Proc Natl Acad Sci
Shi Z, Chambers MC, Zimmerman LJ, Shad- U S A 98(9):5116–5121. https://doi.org/10.
dox KF, Kim S, Davies SR, Wang S, Wang P, 1073/pnas.091062498
Kinsinger CR, Rivers RC, Rodriguez H, Town- 16. Benjamini Y, Hochberg Y (1995) Controlling
send RR, Ellis MJ, Carr SA, Tabb DL, Coffey the false discovery rate: a practical and powerful
RJ, Slebos RJ, Liebler DC, Nci C (2014) Pro- approach to multiple testing. J R Stat Soc Series
teogenomic characterization of human colon B 57:289–300
and rectal cancer. Nature 513 17. Fisher RA (1922) On the interpretation of x
(7518):382–387. https://doi.org/10.1038/ (2) from contingency tables, and the calcula-
nature13438 tion of P. J R Stat Soc 85:87–94. https://doi.
9. Iglesias-Gato D, Wikstrom P, Tyanova S, org/10.2307/2340521
Lavallee C, Thysell E, Carlsson J, Hagglof C, 18. Cox J, Mann M (2012) 1D and 2D annotation
Cox J, Andren O, Stattin P, Egevad L, enrichment: a statistical method integrating
Widmark A, Bjartell A, Collins CC, Bergh A,
quantitative proteomics with complementary homeostasis. Cell Syst 2(3):172–184. https://

high-throughput data. BMC Bioinformatics doi.org/10.1016/j.cels.2016.02.001
13(Suppl 16):S12. https://doi.org/10.1186/ 21. Cox J, Mann M (2008) MaxQuant enables
1471-2105-13-S16-S12 high peptide identification rates, individualized
19. Tyanova S, Temu T, Sinitcyn P, Carlson A, p.p.b.-range mass accuracies and proteome-
Hein MY, Geiger T, Mann M, Cox J (2016) wide protein quantification. Nat Biotechnol
The Perseus computational platform for com- 26(12):1367–1372. https://doi.org/10.
prehensive analysis of (prote)omics data. Nat 1038/nbt.1511
Methods 13(9):731–740. https://doi.org/ 22. Schwanh€ausser B, Busse D, Li N, Dittmar G,
10.1038/nmeth.3901 Schuchhardt J, Wolf J, Chen W, Selbach M
20. Pozniak Y, Balint-Lahat N, Rudolph JD, (2011) Global quantification of mammalian
Lindskog C, Katzir R, Avivi C, Ponten F, gene expression control. Nature 473
Ruppin E, Barshack I, Geiger T (2016) (7347):337–342. https://doi.org/10.1038/
System-wide clinical proteomics of breast can- nature10098
cer reveals global remodeling of tissue
were made.
Chapter 8
Quantitative Analysis of Tyrosine Kinase Signaling Across

Differentially Embedded Human Glioblastoma Tumors
Hannah Johnson and Forest M. White
Abstract
Glioblastoma is the most aggressive primary brain tumor with a poor mean survival even with the current
standard of care. Kinase signaling analyses of clinical glioblastoma samples provide a physiologically relevant
view of oncogenic signaling networks. Here, we describe the methods that enable the quantification of
protein expression profiles and phosphotyrosine signaling across flash frozen and optimal cutting tempera-
ture (OCT) compound embedded tumor specimens. The data derived from these experiments can be used
to identify the intra- and inter-patient heterogeneity present in these tumors. Correlation and functional
analyses on the quantitative protein expression and phosphotyrosine signaling data obtained from clinical
samples can be used to identify tyrosine kinase signaling networks present in these tumors and reveal the
differential expression of functionally related proteins. This chapter provides the quantitative mass spec-
trometry methods required for the identification of in vivo oncogenic signaling networks from human
tumor specimens.
Key words Glioblastoma, iTRAQ labeling, Heterogeneity, Phosphorylation, Mass spectrometry
1 Introduction
Glioblastoma (GBM) is the most common primary brain tumor

with the current standard of care consisting of surgical removal,
radiotherapy, and chemotherapy [1]. Despite these invasive inter-
ventions the median survival time remains at approximately
15 months following diagnosis [2]. Molecular classification of
GBM tumors has led to the identification of four sub-classes of
GBM, based largely on differences in transcriptional profiles: classi-
cal, mesenchymal, neural, and proneural. While each subtype is
associated with the mutation/dysregulation of selected molecular
drivers, the intra-tumoral heterogeneity is such that different
tumors within each sub-type may still have different oncogenic
drivers [3–6]. Intriguingly, activation of receptor tyrosine kinase
(RTK) signaling, through overexpression or mutation, is a com-
mon feature in >80% of all glioblastomas, thereby implicating
149
150 Hannah Johnson and Forest M. White
kinase signaling in the pathogenesis of glioblastoma [6]. Moreover,

most RTKs are attractive targets for therapeutic intervention, as
their activation potentiates survival through MAPK and PI3K/
AKT signaling [7]. Unfortunately, it is often difficult to determine
which RTK(s) are activated in a given tumor from genomic
profiling alone, as RTK activation is typically regulated at the
protein posttranslational modification level rather than at the tran-
scriptional level. Therefore, in order to directly identify RTK activ-
ity and thereby select potential RTK-targeted therapeutic strategies
for a given patient tumor, we have recently developed an approach
to quantifying protein tyrosine phosphorylation and protein
expression profiles in human tumor tissue specimens [8].
Using this approach, it is possible to quantify phosphorylation
events across patient samples with high sensitivity and throughput.
Profiling tyrosine phosphorylation by mass spectrometry has been
demonstrated to identify activated tyrosine kinase signaling path-
ways across a number of tumor types [9, 10]. Accurate characteri-
zation of dynamic phosphorylation signaling can prove challenging
due to the limited availability of clinical samples for proteomic
analysis. Differential embedding of human tumors before storage
often compounds the limited availability of clinical samples
[8]. Embedding tissues using formalin-fixed paraffin (FFPE) and
optimal cutting temperature (OCT) compound is routine in
pathology labs to allow sectioning histopathologic analysis
[11]. Evaluation of the ability to quantify activated signaling net-
works and protein expression profiles across these differentially
embedded tumors will allow the utlization of available tissue sam-
ples [8, 12]. Furthermore, these analyses allow the identification of
(i) signaling changes that can occur in the tumor between resection
and freezing [13], and (ii) the presence of intra-tumoral heteroge-
neity [14–16]. We have previously quantified phosphotyrosine sig-
naling across a panel of glioblastoma patient-derived xenograft
(PDX) tumors with differing expression of the epidermal growth
factor (EGFR) variant vIII [17], a panel of differentially embedded
human glioblastoma tumors [8], and across a panel of colorectal
and ovarian tumors [13]. Throughout these analyses we have iden-
tified inter and intra-tumoral heterogeneity at the tyrosine kinase
signaling level.
In this chapter, we describe the quantification of tyrosine kinase
signaling using iTRAQ labeling of human glioblastoma tumors.
The methodology described here can be readily applied to other
tumor types. To investigate the effect of alternate storage methods
on protein stability or protein posttranslational modifications, we
have quantified activated phosphotyrosine networks and profiled
global protein expression across pairs of human glioblastoma tumor
sections that have been either embedded in OCT compound fol-
lowed by flash freezing in liquid nitrogen (LN2) or immediately
flash frozen in LN2. Samples were labeled with iTRAQ8plex
Quantitative Tyrosine Kinase Signaling in Glioblastoma 151
Fig. 1 Quantification of tyrosine phosphorylation signaling and protein expression profiles across human
glioblastoma tumors. Experimental mass spectrometry workflow. Human glioblastoma tumor sections are
homogenized, reduced, alkylated, and digested with trypsin and peptides labeled with iTRAQ8plex.
Phosphotyrosine peptide enrichment was carried out by immunoprecipitation using anti-phosphotyrosine
antibodies and analyzed by LC-MS/MS. For protein expression profiling, peptides are fractionated by
isoelectric focusing and analyzed by LC-MS/MS
followed by phosphotyrosine peptide enrichment and protein

expression profiling across the panel of human glioblastoma tumors
(Fig. 1) [8]. A quantitative proteomic analysis of these clinical
samples allows the identification of the effects of sample storage
on tyrosine kinase signaling and protein expression profiles and
enables the identification of oncogenic kinase signaling. To identify
activated phosphotyrosine signaling related to glioblastoma biol-
ogy, we describe correlation analysis and functional analysis of the
quantitative proteomic data to highlight groups of related proteins
that are co-expressed in glioblastoma tumors. Ultimately, the meth-
ods described here allow the identification of activated tyrosine
kinases and downstream signaling in vivo in the context of inter-
and intra-tumoral heterogeneity.
2 Materials
Prepare all the solutions using HPLC grade solvents unless indi-
cated otherwise. Prepare and store all the reagents at room temper-
ature unless indicated otherwise. Follow waste disposal regulations
when disposing of chemicals.
2.1 Tumor 1. Polytron hand held homogenizer.

Homogenization 2. Timer.
3. Phosphate-buffered saline (PBS).
4. Protein phosphotyrosyl phosphatase inhibitor: 200 mM
sodium orthovanadate (Na3VO4) stock, make 100 μL aliquots
and store at 20 C until ready to use.
5. Complete protease and phosphatase inhibitors.
6. Mass spectrometry lysis buffer: 8 M urea. Add urea to MilliQ
water and vortex to dissolve. Supplement with 1 mM sodium
orthovanadate, 0.1% NP-40, and protease and phosphatase
inhibitor cocktail tablets.
7. Immunoblotting lysis buffer: Radioimmunoprecipitation assay
(RIPA) buffer supplemented with 1 mM sodium orthovana-
date, 0.1% NP-40, and protease and phosphatase inhibitor
cocktail tablets.
8. Bicinchoninic acid (BCA) assay.
9. Methanol.
10. Liquid nitrogen.
11. Scales.
12. Dry ice.
2.2 Immunoblotting 1. 7.5% polyacrylamide gels.

2. Nitrocellulose.
3. Quantitative Tyrosine Kinase Signaling in GlioblastomaBlock-
ing buffer: 5% BSA in Tris-buffered saline with tween (TBS-T)
(150 mM NaCl, 0.1% Tween 20, 50 mM Tris–HCl, pH 8.0).
4. Primary antibodies: anti-phosphotyrosine (4G10, Millipore), anti-
EGFR (BD Biosciences), anti-Her3/ErbB3 (CST), anti-
PDGFRα (CST), anti-PDGFRβ (CST), anti-Met (CST),
anti-AKT (CST), anti-AKT pS473 (CST), anti-p53 (CST), and
anti-β-tubulin (CST).
5. Secondary antibodies: goat anti-rabbit or goat anti-mouse con-
jugated to horseradish peroxidase.
6. Enhanced chemiluminescence (ECL) detection kit.
2.3 Reduction, 1. 100 mM ammonium acetate in water, pH 8.9.

Alkylation, and Tryptic 2. Reducing solution: 1 M dithiothreitol (DTT) in 100 mM
Digestion ammonium acetate pH 8.9.Store at 20 C in aliquots.
3. Alkylation solution: 1 M iodoacetamide (IAA) in 100 mM
ammonium acetate pH 8.9.
4. Sequencing grade trypsin.
5. Formic acid.
6. C18 cartridges.
7. Acetonitrile.
2.4 iTRAQ 8plex 1. iTRAQ reagents.

Labeling 2. Dissolution buffer: 500 mM triethylammonium bicarbonate
(TEAB), pH 8.5.
3. Isopropanol.
4. 0.1% acetic acid.
5. Vacuum centrifuge.
2.5 Phosphotyrosine 1. Protein G agarose.

Peptide 2. Immunoprecipitation (IP) buffer: 100 mM Tris–HCl, 1%
Immunoprecipitation NP-40, pH 7.4.
3. Tris buffer: 500 mM Tris–HCl, pH 8.5.
4. Phosphotyrosine antibodies: 4G10 (Millipore), PY100 (CST),
and PT66 (Sigma).
5. Rinse buffer: 100 mM Tris–HCl, pH 7.4.
6. Elution buffer: 100 mM glycine, pH 2–2.5.
7. pH paper.
2.6 Phosphopeptide 1. Self-packed IMAC columns (15 cm in length): Pack columns

Enrichment by IMAC with Poros 20MC beads. Capillary type: OD 360 μm: ID
200 μm.
2. Easy-nLC 1000 Nano HPLC.
3. 100 mM iron(III) chloride.
4. MilliQ water.
5. 100 mM EDTA pH 8.0.
6. 0.1% acetic acid.
7. 250 mM sodium phosphate pH 8.0.
2.7 Peptide 1. MilliQ water.

Isoelectric Focusing 2. Formic acid.
and Protein Expression
3. Acetonitrile.
Profiling
4. ZOOM isoelectric focusing (IEF) fractionator.
5. Six ZOOM disks: pH 3.0, pH 4.6, pH 5.4, pH 6.2, pH 7.0,
and pH 10.0.
6. Anode buffer: 7 M urea, 2 M thiourea, Novex IEF anode
buffer, pH 3.0.
7. Cathode buffer: 7 M urea, 2 M thiourea, Novex IEF cathode
buffer, pH 10.4.
8. ZOOM carrier ampholytes.
9. C18 cartridges.
10. Acetic acid.
2.8 Mass 1. Self-packed pre-columns (15 cm in length): Pack pre-columns

Spectrometric with 10 μm YMC gel, ODS-A, 12 nm beads. Capillary type:
Analyses OD 360 μm: ID 50 μm.
2. Self-packed analytical columns (15 cm in length): Pack analyti-
cal columns with 5 μm beads (YMC gel, ODS-AQ, 12 nm,
S-5 μm, AQ12S05). (Capillary type: OD 360 μm: ID 100 μm).
3. Easy-nLC 1000 Nano HPLC.
4. Mass Spectrometer, e.g., Orbitrap QExactive Plus mass spec-
trometer (Thermo Fisher Scientific).
5. Buffer A: 200 mM acetic acid.
6. Buffer B: 70% Acetonitrile, 200 mM acetic acid.
2.9 Protein 1. Proteome Discoverer can be obtained from Thermo scientific.

Expression Data 2. MASCOT search engine software can be obtained from Matrix
Analysis Science; http://www.matrixscience.com/
and Functional Data 3. Human protein sequence database, downloadable from NCBI.
Analysis
4. CAMV. CAMV is open source software that can be down-
loaded from http://white-lab.mit.edu/software/camv
5. GENE-E. GENE-E is open source software that can be down-
loaded from http://www.broadinstitute.org/cancer/soft
ware/GENE-E/
6. PANTHER. is an online gene ontology tool that can be
accessed here: http://www.pantherdb.org/
7. STRING protein-protein interaction functional database can
be accessed here; http://string-db.org/
8. Phosphosite online database for phosphorylation sites can be
accessed here; http://www.phosphosite.org/
3 Methods
3.1 Tumor It is essential that tumors are flash frozen immediately following
Homogenization resection as cold ischemia can lead to significant changes in the
and Removal tyrosine kinase signaling network [13]. Perform steps 3–8 in the
of Optimal Cutting chemical hood on ice.
Temperature 1. Immediately flash freeze tumor samples in liquid nitrogen
following resection or embed in OCT compound and flash
freeze in liquid nitrogen as soon as possible (ideally within
5 min).
2. Take tissue out of the tube using tweezers and deposit it in a
weighting tray that has been previously tared. Record the
weight of the tumor, the size, and the shape (see Note 1).
3. Rinse OCT compound embedded tumors in ice-cold PBS to

remove the OCT compound surrounding the tissue prior to
homogenization. Work on ice and minimize the time taken to
carry out this step. Place the samples on dry ice once they have
been thoroughly rinsed.
4. Homogenize tumors in ice-cold MS lysis buffer or RIPA buffer
for immunoblotting, on ice, with 6 10 s pulses (full speed),
separated by 10 s intervals. Homogenize tumors weighing
approximately 50–200 mg in 3 mL of lysis buffer. Carefully
assess the tube to identify any visible tissue left in the lysis buffer
at the end of this procedure. Apply additional 10 s pulses if
necessary.
5. Centrifuge tissue homogenate at 1070 g for 5 min at 4 C.
6. Take a 50 μL aliquot for BCA assay and place the rest of the
lysate immediately on dry ice and store at 80 C.
7. Quantify protein concentrations using BCA.
8. Rinse the polytron homogenizer thoroughly with PBS and
methanol between samples.
3.2 Immunoblotting The RTK status of the tumors can be used to help understand the
tyrosine phosphorylation results (e.g., to help define the relative
stoichiometry of phosphorylation between samples). RTK expres-
sion can be assessed using standard immunoblotting (see Note 2).
1. Separate tissue homogenates on 7.5% polyacrylamide gels and
electrophoretically transfer to nitrocellulose.
2. Block nitrocellulose with blocking buffer for 1 h.
3. Dilute primary antibodies in blocking buffer and incubate with
nitrocellulose overnight at 4 C.
4. Dilute secondary antibodies (either goat anti-rabbit or goat
anti-mouse conjugated to horseradish peroxidase) in TBS-T
at a 1:10,000 ratio and incubate at room temperature for 1 h.
5. Wash nitrocellulose 3 10 min with TBS-T.
6. Detect antibody binding with ECL, film, and a standard
developer.
3.3 Reduction, 1. Dilute tissue homogenates 1:10 with 100 mM ammonium

Alkylation, and Tryptic acetate pH 8.9 to reduce the urea concentration to less than
Digestion 800 mM.
2. Reduce protein disulfide bridges by adding 10 mM DTT to
each tumor homogenate to be analyzed and incubate at 56 C
for 45 min.
3. Alkylate reduced cysteines with 50 mM IAA at room tempera-
ture in the dark for 1 h.
4. Digest proteins with sequencing grade trypsin at an enzyme/

substrate ratio of 1:100, on rotator at room temperature
overnight.
5. Quench trypsin activity by adding formic acid to a final con-
centration of 5%.
6. Remove urea from the samples by reverse phase desalting using
a C18 cartridge. Elute the peptides from the C18 cartridge into
80% acetonitrile, 0.1% formic acid.
7. Lyophilize the peptides in 400 μg aliquots and store at 80 C.
3.4 iTRAQ 8plex iTRAQ labeling currently allows the multiplexed quantification
Labeling across up to eight different samples. Multiple iTRAQ8plex experi-
ments can be combined to quantify across multiples of eight
tumors. This multiplexing strategy requires the presence of a com-
mon sample in each experiment in order to compare across differ-
ent experiments. This multiplex labeling strategy can also be
performed with TMT reagents, available in 6-plex or 10-plex.
1. Label 400 μg peptide (quantified by BCA before C18 desalt-
ing) from each of the tumors with one tube of iTRAQ 8plex
reagent (see Note 3).
2. Dissolve 400 μg lyophilized peptides in 30 μL dissolution
buffer. Vortex each sample for 1 min and spin at 12,000 g
for 1 min.
3. Dissolve each tube of iTRAQ reagent in 70 μL of isopropanol.
Vortex each tube for 1 min and spin at 12,000 g for 1 min.
4. Add the isopropanol and iTRAQ 8plex reagent to the 400 μg
peptide in dissolution buffer and vortex. Incubate at room
temperature for 2 h.
5. Concentrate the eight tubes of peptide/iTRAQ mix to 40 μL
using a vacuum centrifuge (speed-vac).
6. Combine the eight differentially labeled samples into a
single tube.
7. Sequentially rinse out all the tubes with 3 60 μL 0.1% acetic
acid and add to the sample.
8. Concentrate the combined iTRAQ sample using a vacuum
centrifuge (spin to dryness) and store at 80 C. At this
point the sample is stable for long-term storage (see Note 4).
3.5 Phosphotyrosine Due to the low abundance of phosphotyrosine in the cell, it is

Profiling necessary to carry out a series of steps to selectively enrich peptides
that contain a phosphotyrosine residue. All the steps should be
carried out at 4 C unless otherwise stated.
1. Wash 60 μL protein G agarose with 300 μL IP buffer. Centri-

fuge for 5 min at 2850 g and remove the supernatant to
waste.
2. To the washed protein G agarose add 12 μg 4G10, 12 μg
PY100, and 12 μg PT66.
3. Allow the antibody to conjugate to the protein G agarose at
4 C for 6–8 h with rotation.
4. Spin down the mixture at 2850 g for 5 min. Remove the
supernatant and discard.
5. Wash the beads with 400 μL IP buffer, for 5 min on the rotor.
Spin the beads down in the centrifuge at 2850 g for 5 min.
6. Resuspend the iTRAQ 8plex labeled peptides in 400 μL IP
buffer, vortex until the sample is completely dissolved and
adjust the pH to 7.4 using Tris buffer (500 mM, pH 8.5) (see
Note 5).
7. Remove the supernatant from the beads and replace it with the
sample.
8. Incubate the sample with the beads on the rotor at 4 C
overnight.
9. The next day, centrifuge in the cold room at 2850 g for
5 min.
10. Save the supernatant in a new tube and store it at 80 C until
carrying out protein expression profiling (see Note 6).
11. Wash the beads with 1 400 μL IP buffer and then 3 400 μL
rinse buffer. Place the tube on the rotator for 5 min, spin down
at 2850 g for 5 min, and remove the supernatant in between
each wash step. Discard the supernatant.
12. Add 70 μL of elution buffer and incubate at room temperature
on the rotor for 30 min.
13. Load the eluate onto an immobilized metal affinity chroma-
tography (IMAC) column.
3.6 Phosphopeptide IMAC columns are packed and used according to the previously
Enrichment by IMAC described protocol [18]. The steps required to enrich for phospho-
peptides are briefly highlighted here.
1. Rinse the IMAC column with 100 mM EDTA pH 8.0 for
10 min at 10 μL/min.
2. Rinse the IMAC column with MilliQ water for 10 min at
10 μL/min.
3. Load 100 mM iron(III) chloride onto the column for 30 min
at 10 μL/min (see Note 7).
4. Rinse the IMAC column with 0.1% acetic acid for 10 min at
10 μL/min.
5. Load iTRAQ 8plex labeled phosphotyrosine immunoprecipita-

tion eluate to the IMAC column at a rate of 1–2 μL/min.
6. Rinse with 0.1% acetic acid at 10–12 μL/min for 10 min.
7. Elute peptides retained on the IMAC column onto a C18
reverse-phase pre-column with 250 mM sodium phosphate
pH 8.0.
8. Rinse pre-column with 0.1% acetic acid to remove excess phos-
phate buffer and analyze by MS.
3.7 Analysis 1. After rinsing with 0.1% acetic acid, attach the pre-column to a
of Tyrosine C18 reverse-phase analytical column with integrated electro-
Phosphorylation by MS spray emitter tip.
2. Chromatographically separate peptides by reverse phase HPLC
over a 140 min gradient, with the eluent ionized by nanoelec-
trospray into an Orbitrap QExactive Plus instrument.
3. Operate the instrument in positive ion mode. Record full scans
in the Orbitrap mass analyzer (resolution- FWHM 60,000) at a
mass/charge (m/z) range of 350–2000 in profile mode. Select
the top 15 most intense ions per scan for higher-energy C-trap
dissociation (HCD)-based MS/MS analysis for peptide frag-
mentation and for iTRAQ reporter ion quantification, record-
ing MS/MS scans in the Orbitrap mass analyzer (resolution-
FWHM 60,000) at a mass/charge (m/z) range of 100–2000 in
profile mode.
3.8 Peptide Understanding the protein expression profile within heterogeneous

Isoelectric Focusing tumors can provide additional biological insight to accompany
and Protein Expression phosphorylation changes. Additionally, protein expression profiling
Profiling can often provide a molecular basis for the observed phosphoryla-
tion changes, due to the dramatically different genetic backgrounds
of each tumor.
1. Fractionate iTRAQ labeled peptides into five fractions using
the ZOOM IEF fractionator with a set of 6 ZOOM disks
(pH 3.0, pH 4.6, pH 5.4, pH 6.2, pH 7.0, and pH 10).
2. Use anode buffers and cathode buffers as per the manufac-
turer’s instructions.
3. Add MilliQ water to the iTRAQ 8plex labeled peptide sample
to a final volume of 3.35 mL.
4. Add ZOOM carrier ampholytes to each diluted sample at
1:100 and DTT to a final concentration of 20 mM.
5. Perform fractionation using the following parameters: 2 mA
current limit, 2 W power limit, with 100 V for 20 min, 200 V
for 80 min, and 600 V for 80 min.
6. Following fractionation, rinse each chamber with 500 μL

water, and add the wash to the appropriate fraction.
7. Acidify each fraction with formic acid to 0.2% and desalt using
C18 Cartridges. Elute peptides into 90% acetronitrile in 0.1%
acetic acid.
8. Concentrate fractions to near dryness in a vacuum centrifuge.
9. Resuspend each fraction in 200 μL 0.1% acetic acid.
10. Dilute each resuspended fraction 1:100 with 0.1% acetic acid
and load 20 μL (approximately 600 ng protein) onto an acid-
ified pre-column.
3.9 Proteome 1. Attach the peptide loaded pre-column to a C18 reverse-phase

Analysis by MS analytical column with integrated electrospray emitter tip.
2. Chromatographically separate each fraction by reverse phase
HPLC over a 240 min gradient with the eluent ionized by
nanoelectrospray into an Orbitrap QExactive Plus mass
spectrometer.
3. Operate the instrument in positive ion mode. Record full scans
in the Orbitrap mass analyzer (resolution- FWHM 60,000) at a
mass/charge (m/z) range of 350–2000 in profile mode. Select
the top 15 most intense ions per scan for HCD-based MS/MS
analysis for peptide fragmentation and for iTRAQ reporter ion
quantification, recording MS/MS scans in the Orbitrap mass
analyzer (resolution-FWHM 60,000) at a mass/charge (m/z)
range of 100–2000 in profile mode.
3.10 Protein 1. Relative quantification and protein identification can be per-

Expression Data formed using Proteome Discoverer with MASCOT as the
Analysis search engine.
2. Search MS/MS spectra against the human protein sequence
database, downloadable from NCBI.
3. Search parameters should be set to “carbamidomethylation of
cysteines” by IAA as a static modification, with “oxidation of
methionine” and “iTRAQ 8plex labeling” as additional
dynamic modifications.
4. Relative quantitation of protein expression can be performed
by selecting for proteins containing a minimum of at least two
peptides with MASCOT score above 20. Only peptides unique
for a given protein should be considered for relative quantita-
tion, those common to other isoforms or proteins of the same
family should be excluded. Identified peptides should be
excluded from quantitative analyses if (1) the peaks
corresponding to the iTRAQ labels are not detected, (2) the
same peptide sequence is shared by multiple proteins, or (3) the
peptide sequence is discordant.
5. A decoy database search strategy can be used to estimate the

false discovery rate (FDR), defined as the percentage of decoy
proteins identified against the total protein identification. In
this case, the MASCOT score threshold for peptide or protein
identification can be established by setting the FDR to 1%
following a search of the spectra against the NCBI non redun-
dant Homo sapiens decoy database.
3.11 Understanding the tyrosine kinase signaling within heterogeneous

Phosphotyrosine Data tumors is critical to define the activated networks responsible for
Analysis tumor cell proliferation, migration, and invasion.
1. Relative quantification and phosphotyrosine peptide identifica-
tion can be performed using Proteome Discoverer with MAS-
COT software as the search engine.
2. Search MS/MS spectra against human protein sequence data-
base, downloadable from NCBI.
3. Search parameters should be set to “carbamidomethylation of
cysteines by IAA” as a static modification, with additional
dynamic modifications; “oxidation of methionine,” “iTRAQ
8plex labeling” and “phosphorylation of serine, threonine, and
tyrosine” (see Note 8).
4. Peptides identified by MASCOT with an ion score >25 should
be considered for further manual validation and quantification
using CAMV (see Note 9) [19].
3.12 Functional Data 1. To identify groups of similarly expressed proteins and phos-
Analysis phorylation sites, perform unsupervised hierarchical clustering.
(a) Clustering of the mean normalized and log2 transformed
phosphotyrosine and protein expression quantitative
iTRAQ data (using one minus Pearson correlation as a
distance metric) can be performed using GENE-E (see
Note 10).
2. To visualize quantitative phosphotyrosine and protein expres-
sion profiles across tumors, generate heat maps of mean nor-
malized and log2 transformed phosphotyrosine and protein
expression quantitative iTRAQ data (see Note 11).
(a) When using GENE-E, upload an excel file with mean
normalized and log2 transformed phosphotyrosine and
protein expression quantitative iTRAQ data with the
quantitative information specified in a data matrix, where
phosphorylation sites are row metadata and
corresponding iTRAQ labels are column.
(b) Heat maps can be aesthetically modified under
“preferences.”
3. To identify differences between differentially embedded

tumors (inter-tumoral heterogeneity) and between different
patients tumors (intra-tumoral heterogeneity) carry out Pear-
son’s correlation analysis of the quantitative phosphotyrosine
or protein expression profiles using Excel. P-values can be
calculated using t approximation (see Note 12).
(a) Pearson’s correlation (r) can be calculated in Excel using
the PEARSON function.
(b) To calculate the p-value for any value of r, calculate the
associated t-value (t) using the following formula:
pffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffi
t ¼ r n2
1r 2
(c) Once the t-value is calculated, use the TDIST function in

Excel to find the associated p-value. This function requires
the value of t, the degrees of freedom (i.e., n2) and the
number of tails.
4. Gene ontology (GO) annotations can be identified by upload-
ing gene lists to the Protein Analysis Through Evolutionary
Relationships (PANTHER) online classification system (see
Note 13).
(a) Protein names acquired from MS data output can be con-
verted to Gene names and symbols using the online phos-
phorylation site database Phosphosite.
5. The Search Tool for the Retrieval of Interacting Genes/Pro-
teins (STRING) database can be queried to identify known and
predicted protein-protein interactions within clusters of
co-expressed proteins and phosphotyrosine sites.
4 Notes
1. Recording the weight and morphology of each tumor before

sectioning allows the assessment of intratumoral heterogeneity
at the cellular level. Further sectioning of the tumor prior to
the analysis of proteins and phosphorylation sites enables a
complete understanding of the tumor origin (i.e., was the
tumor section derived from the center of the tumor? the edge
of the tumor?). Assessing the tumor content of each tumor
section enables accurate assessment of the protein and phos-
photyrosine level across multiple samples of the same tumor.
2. Due to the low abundance of tyrosine kinases in tissue samples
it is common to not detect these proteins in whole proteome
mass spectrometry profiling analyses. Immunoblotting pre-
sents a simple and effective method of identifying the relative
levels of tyrosine kinases present within tissues.
3. Depending on the level of tyrosine phosphorylation in the

sample and the sensitivity of the LC-MS/MS system, it may
be necessary to label 800 μg peptide with two tubes of
iTRAQ8plex reagent.
4. Due to the stability of peptides in a dry state peptides can be
stored long term (> a year) at 80 C. Repeated freeze-thaw
cycles should be avoided, if peptide samples need to be period-
ically taken from the stock, make a series of aliquots from the
stock prior to drying and store at 80 C.
5. Add low volumes of the Tris buffer pH 8.5 (i.e., 1–5 μL incre-
ments) and measure pH using pH paper.
6. Peptide solutions are prone to degradation. To minimize this,
aliquot peptide solutions and store at 80 C. Avoid repeated
freeze-thaw cycles, as this can lead to peptide degradation.
7. Iron(III) chloride should be kept dry due to the exothermic
reaction that takes place when iron(III) chloride undergoes
hydrolysis.
8. Phosphorylation of tyrosine needs to be a dynamic modification,
as many of the tyrosines in the sample are not phosphorylated.
9. Given the importance of site identification, manual validation is
critical to properly localize the phosphorylation site within the
peptide.
10. Hierarchical clustering can be used to identify groups of phos-
photyrosine sites and proteins that are similarly and differen-
tially expressed across the different tumor samples.
Alternatively, the quantitative data can be clustered using affin-
ity propagation clustering [17].
11. Heat maps can be generated using software packages such as
GENE-E, MATLAB (www.mathworks.com/), or R (https://
www.r-project.org/).
12. Pearson’s correlation analysis can alternatively be carried out
using software packages such as MATLAB (www.mathworks.
com/) or R (https://www.r-project.org/).
13. Functional grouping of proteins and/or phosphorylation sites
can also be assessed using DAVID functional annotation
(https://david.ncifcrf.gov) or Cytoscape (www.cytoscape.org/).
Acknowledgments
This work was supported in part by a generous gift from the James
S. McDonnell Foundation and by NIH grants P30 CA014051 and
R01 CA184320. The authors would like to thank Ms. Marcela
White at the brain tumor bank (www.Braintumourbank.com) for
access to patient materials.
References
1. Stupp R, Mason WP, van den Bent MJ, glioblastoma genes and core pathways. Nature
Weller M, Fisher B, Taphoorn MJ, 455(7216):1061–1068
Belanger K, Brandes AA, Marosi C, 7. Krakstad C, Chekenya M (2010) Survival sig-
Bogdahn U, Curschmann J, Janzer RC, Lud- nalling and apoptosis resistance in glioblasto-
win SK, Gorlia T, Allgeier A, Lacombe D, mas: opportunities for targeted therapeutics.
Cairncross JG, Eisenhauer E, Mirimanoff RO Mol Cancer 9:135. https://doi.org/10.
(2005) Radiotherapy plus concomitant and 1186/1476-4598-9-135
adjuvant temozolomide for glioblastoma. N 8. Johnson H, White FM (2014) Quantitative
Engl J Med 352(10):987–996. https://doi. analysis of signaling networks across differen-
org/10.1056/NEJMoa043330 tially embedded tumors highlights interpatient
2. Stupp R, Hegi ME, Mason WP, van den Bent heterogeneity in human glioblastoma. J Prote-
MJ, Taphoorn MJ, Janzer RC, Ludwin SK, ome Res 13(11):4581–4593. https://doi.org/
Allgeier A, Fisher B, Belanger K, Hau P, 10.1021/pr500418w
Brandes AA, Gijtenbeek J, Marosi C, Vecht 9. Rikova K, Guo A, Zeng Q, Possemato A, Yu J,
CJ, Mokhtari K, Wesseling P, Villa S, Haack H, Nardone J, Lee K, Reeves C, Li Y,
Eisenhauer E, Gorlia T, Weller M, Hu Y, Tan Z, Stokes M, Sullivan L, Mitchell J,
Lacombe D, Cairncross JG, Mirimanoff RO Wetzel R, Macneill J, Ren JM, Yuan J, Baka-
(2009) Effects of radiotherapy with concomi- larski CE, Villen J, Kornhauser JM, Smith B,
tant and adjuvant temozolomide versus radio- Li D, Zhou X, Gygi SP, TL G, Polakiewicz RD,
therapy alone on survival in glioblastoma in a Rush J, Comb MJ (2007) Global survey of
randomised phase III study: 5-year analysis of phosphotyrosine signaling identifies oncogenic
the EORTC-NCIC trial. Lancet Oncol 10 kinases in lung cancer. Cell 131
(5):459–466. https://doi.org/10.1016/ (6):1190–1203. https://doi.org/10.1016/j.
S1470-2045(09)70025-7 cell.2007.11.025
3. Verhaak RG, Hoadley KA, Purdom E, Wang V, 10. Drake JM, Graham NA, Lee JK, Stoyanova T,
Qi Y, Wilkerson MD, Miller CR, Ding L, Faltermeier CM, Sud S, Titz B, Huang J,
Golub T, Mesirov JP, Alexe G, Lawrence M, Pienta KJ, Graeber TG, Witte ON (2013) Met-
O’Kelly M, Tamayo P, Weir BA, Gabriel S, astatic castration-resistant prostate cancer
Winckler W, Gupta S, Jakkula L, Feiler HS, reveals intrapatient similarity and interpatient
Hodgson JG, James CD, Sarkaria JN, heterogeneity of therapeutic kinase targets.
Brennan C, Kahn A, Spellman PT, Wilson Proc Natl Acad Sci U S A 110(49):
RK, Speed TP, Gray JW, Meyerson M, E4762–E4769. https://doi.org/10.1073/
Getz G, Perou CM, Hayes DN (2006) pnas.1319948110
Integrated genomic analysis identifies clinically
relevant subtypes of glioblastoma characterized 11. Steu S, Baucamp M, von Dach G, Bawohl M,
by abnormalities in PDGFRA, IDH1, EGFR, Dettwiler S, Storz M, Moch H, Schraml P
and NF1. Cancer Cell 17(1):98–110. https:// (2008) A procedure for tissue freezing and
doi.org/10.1016/j.ccr.2009.12.020 processing applicable to both intra-operative
frozen section diagnosis and tissue banking in
4. Brennan C, Momota H, Hambardzumyan D, surgical pathology. Virchows Arch 452
Ozawa T, Tandon A, Pedraza A, Holland E (3):305–312. https://doi.org/10.1007/
(2009) Glioblastoma subclasses can be defined s00428-008-0584-y
by activity among signal transduction pathways
and associated genomic alterations. PLoS One 12. Loken SD, Demetrick DJ (2005) A novel
4(11):e7752. https://doi.org/10.1371/jour method for freezing and storing research tissue
nal.pone.0007752 bank specimens. Hum Pathol 36(9):977–980.
https://doi.org/10.1016/j.humpath.2005.
5. Phillips HS, Kharbanda S, Chen R, Forrest WF, 06.016
Soriano RH, TD W, Misra A, Nigro JM,
Colman H, Soroceanu L, Williams PM, 13. Gajadhar AS, Johnson H, Slebos RJ,
Modrusan Z, Feuerstein BG, Aldape K Shaddox K, Wiles K, Washington MK, Herline
(2006) Molecular subclasses of high-grade gli- AJ, Levine DA, Liebler DC, White FM (2015)
oma predict prognosis, delineate a pattern of Phosphotyrosine signaling analysis in human
disease progression, and resemble stages in tumors is confounded by systemic ischemia-
neurogenesis. Cancer Cell 9(3):157–173. driven artifacts and intra-specimen heterogene-
https://doi.org/10.1016/j.ccr.2006.02.019 ity. Cancer Res 75(7):1495–1503. https://doi.
org/10.1158/0008-5472.CAN-14-2309
6. Network TCGAR (2008) Comprehensive
genomic characterization defines human 14. Snuderl M, Fazlollahi L, Le LP, Nitta M, Zhe-
lyazkova BH, Davidson CJ, Akhavanfard S,
Cahill DP, Aldape KD, Betensky RA, Louis 17. Johnson H, Del Rosario AM, Bryson BD,
DN, Iafrate AJ (2011) Mosaic amplification of Schroeder MA, Sarkaria JN, White FM (2012)
multiple receptor tyrosine kinase genes in glio- Molecular characterization of EGFR and
blastoma. Cancer Cell 20(6):810–817. EGFRvIII signaling networks in human glio-
https://doi.org/10.1016/j.ccr.2011.11.005 blastoma tumor xenografts. Mol Cell Proteo-
15. Szerlip NJ, Pedraza A, Chakravarty D, mics 11(12):1724–1740. https://doi.org/10.
Azim M, McGuire J, Fang Y, Ozawa T, Hol- 1074/mcp.M112.019984
land EC, Huse JT, Jhanwar S, Leversha MA, 18. Zhang Y, Wolf-Yadlin A, Ross PL, Pappin DJ,
Mikkelsen T, Brennan CW (2012) Intratu- Rush J, Lauffenburger DA, White FM (2005)
moral heterogeneity of receptor tyrosine Time-resolved mass spectrometry of tyrosine
kinases EGFR and PDGFRA amplification in phosphorylation sites in the epidermal growth
glioblastoma defines subpopulations with dis- factor receptor signaling network reveals
tinct growth factor response. Proc Natl Acad dynamic modules. Mol Cell Proteomics 4
Sci U S A 109(8):3041–3046. https://doi. (9):1240–1250. https://doi.org/10.1074/
org/10.1073/pnas.1114033109 mcp.M500089-MCP200
16. Sottoriva A, Spiteri I, Piccirillo SG, 19. Curran TG, Bryson BD, Reigelhaupt M,
Touloumis A, Collins VP, Marioni JC, Johnson H, White FM (2013) Computer
Curtis C, Watts C, Tavare S (2013) Intratumor aided manual validation of mass spectrometry-
heterogeneity in human glioblastoma reflects based proteomic data. Methods 61
cancer evolutionary dynamics. Proc Natl Acad (3):219–226. https://doi.org/10.1016/j.
Sci U S A 110(10):4009–4014. https://doi. ymeth.2013.03.004
org/10.1073/pnas.1219747110
Part III
Systems Analysis of Cancer Cell Metabolism

Chapter 9
Prediction of Clinical Endpoints in Breast Cancer Using NMR

Metabolic Profiles
Leslie R. Euceda, Tonje H. Haukaas, Tone F. Bathen,
and Guro F. Giskeødegård
Abstract
Metabolic profiles reflect biological conditions as a result of biochemical changes within a living system. It is
therefore possible to associate metabolic signatures with clinical endpoints of diseases, such as breast cancer.
Nuclear magnetic resonance (NMR) spectroscopy is one of the most common techniques used for
metabolic profiling, and produces high dimensional datasets from which meaningful biological information
can be extracted. Here, we present an overview of data analysis techniques used to achieve this, describing
key steps in the procedure. Moreover, examples of clinical endpoints of interest are provided. Although
these are specific for breast cancer, the procedures for the analysis of NMR spectra as described here are
applicable to any type of cancer and to other diseases.
Key words Breast cancer, Cross validation, Hierarchical clustering, Hypothesis testing, Metabolites,
Model diagnostic statistics, Multivariate analysis, NMR spectroscopy, Partial least squares, Principle
component analysis, Permutation testing
1 Introduction
Metabolic profiling refers to the large-scale measurement of low

molecular weight metabolites and their intermediates generated
within a living system at a given moment. Those metabolites reflect
a biological condition as a result of biochemical changes caused by
genetic modification and physiological or pathophysiological sti-
muli [1]. Metabolites include substances such as carbohydrates,
amino acids, nucleotides, fatty acids, and vitamins. The complete
set of metabolites in a living system is termed the metabolome,
comparable to the terms genome, transcriptome, and proteome.
These four molecular levels interact with each other, generating an
information exchange flow known as the omics cascade
Leslie R. Euceda and Tonje H. Haukaas contributed equally to this work.
167
168 Leslie R. Euceda et al.
[2]. Metabolites are downstream products affected by the omics

signaling cascade (DNA, RNA, and proteins), but can also affect
upstream signaling processes, such as gene expression. They are
representative for the functional phenotype observed and provide
information of the active biological pathways.
Nuclear magnetic resonance (NMR) spectroscopy is one of the
most commonly used techniques for metabolic profiling, and can
be applied to both biofluids and intact tissue samples. For details on
the NMR theory, we refer to [3]. Briefly, NMR spectroscopy is
based on the intrinsic property of spin possessed by certain atomic
nuclei, such as protons (1H), which gives rise to a small magnetic
field. This magnetic field is called the magnetic moment of a
nucleus, and can be thought of as a vector with direction and
magnitude. When a sample is subjected to an external magnetic
field, the magnetic moments will align either in the direction of that
field or opposite to it, bringing the nuclei to a low or high energy
spin state, respectively. Energies from opposite spin states counter-
balance each other. However, the low energy spin state is slightly
more energetically favorable, and thus a higher population of nuclei
in a system will exist in this state. This generates a residual magne-
tization component parallel to the external magnetic field. In addi-
tion to nuclear spin, nuclei precess, i.e., rotate, about the external
magnetic field axis. The rate of precession, also known as the
resonance frequency or Larmor frequency, corresponds to the
energy difference between the energy spin states. By applying a
radio frequency (RF) pulse that matches the Larmor frequency of
the nuclei of interest, the nuclei will absorb the energy and transi-
tion to a higher energy state. When the RF pulse is switched off, the
nuclei recover to their original energy state, and the released energy
can be measured by receiver coils. The observed signal can be
mathematically converted to an NMR spectrum, which is a plot of
the intensity of emission as a function of the resonance frequency
(see Fig. 1).
Atomic nuclei of the same isotope experience small variations
or shifts in their resonance frequencies depending on the chemical
environment of each nucleus. This will cause the signals to appear at
different positions in the NMR spectrum; this position is termed
the chemical shift. In addition, the signals can be split in different
ways as an effect of the chemical bonds of the nuclei. Therefore, an
NMR spectrum can provide detailed information about molecular
structure of the metabolites to which the observed nuclei belong.
Moreover, the observed signal intensity is proportional to the
number of nuclei giving rise to that signal [4], and can be exploited
for quantification purposes.
Because the resonance frequency of a nucleus is proportional to
the external magnetic field, it can be calibrated relative to the
resonance frequency of a signal from a reference at the applied
magnetic field. This makes NMR spectra comparable independent
NMR Metabolic Profiles of Breast Cancer 169
Fig. 1 Example 1H NMR spectrum of breast tumor tissue. Observable metabolite peaks include glucose (Glc),
ascorbate (Asc), lactate (Lac), myo-inositol (mI), creatine (Cr), glutamate (Glu), glycine (Gly), taurine (Tau),
glycerophosphocholine (GPC), phosphocholine (PCho), choline (Cho), glutathione (GSH), glutamine (Gln),
succinate (Succ), and alanine (Ala)
of the strength of the magnet employed. Since differences in reso-

nance frequencies are very small, the chemical shift scale is
expressed in parts per million (ppm). Trimethylsilyl propionic acid
(TSP) is a common reference compound that is typically calibrated
to 0 ppm [5].
Sample preparation of both liquid and solid state NMR is
simple and non-destructive. High resolution (HR) magnetic reso-
nance spectroscopy (MRS) of biofluids, cell extracts, and culture
media is suited for high-throughput analysis and can typically
detect 20–70 metabolites. For a description of procedures for
recording metabolic profiles of liquid solutions using 1H NMR,
we refer to [6].
In NMR, anisotropic interactions between nuclei are those that
are dependent on the direction of molecules with respect to the
external magnetic field. In solution NMR, these interactions are
averaged out due to high molecular mobility. In solids and semi-
solids, such as tissue samples, molecular motion is restricted and so
anisotropic interactions can give rise to peak broadening that may
ultimately lead to signal overlap. It is possible to overcome this by
rapidly spinning the sample at an angle of 54.7 , known as the
magic angle, which imitates molecular motion in liquid solution.
This method, called high resolution (HR) magic angle spinning
(MAS) MRS, yields highly resolved spectra of tissue comparable
with those obtained for biofluids using conventional MRS. For a
detailed protocol describing HR MAS MRS of intact tissue, we
refer to [7].
NMR acquisition of metabolic profiles results in complex, high

dimensional datasets. Prior to statistical analysis, biologically irrele-
vant differences caused for example by instrumental or experimen-
tal artifacts must be removed from the raw data. This is known as
data preprocessing and involves different computational proce-
dures to convert the acquired data into a format that is usable to
extract meaningful information. Because high intensity lipid peaks
arising from normal breast adipose cells are often present in spectra
from breast tumor tissue, spectral preprocessing may be challeng-
ing and is seldom straightforward [8]. Careful inspection of the
preprocessed spectra is therefore essential to evaluate each individ-
ual preprocessing step. Modifications to the original preprocessing
strategy are often required. For an overview of frequently used
preprocessing tools and a general metabolomics workflow, we
refer to [9], and for a description of specific preprocessing methods
we refer to [10].
In this chapter, we provide an overview of data analysis strate-
gies used to associate metabolic signatures with clinical endpoints
of diseases, with a focus on breast cancer. Additionally, key steps in
those data analysis strategies are described. We furthermore provide
guidance for the interpretation of results.
2 Materials
2.1 Data Input Choosing the optimal approach for statistical analysis is dependent
on the type and structure of the data input as well as the hypothesis
of interest. The NMR spectral data should, prior to statistical
analysis, have been through proper preprocessing procedures. For
multivariate analysis, the preprocessed data forms the X-matrix,
where each row represents one sample and each column represents
one variable or point in an NMR spectrum. Alternatively, metabo-
lites can be quantified to make the data applicable for both multi-
variate and univariate analysis. In such cases, quantified metabolites
from the same sample can be combined so that each variable of the
X-matrix used for multivariate analysis represents one individual
metabolite.
The Y-matrix or vector, used in supervised analysis, contains
information about the clinical endpoint that should be predicted.
The clinical endpoint is defined as the relevant patient information
of interest to test for correlation with metabolic signature. Exam-
ples of clinical outcome variables of interest in breast cancer are
patient 5-year survival, tumor size, tumor cell percentage, lymph
node status, metastatic status, pathological response to treatment,
and hormone (estrogen or progesterone) receptor status. These can
either be categorical (e.g., lymph node status) or continuous vari-
ables (e.g., tumor cell percentage) (see Note 1).
In summary, the different data input structures are:

X-matrix:
– Preprocessed NMR spectra.
– Relative or absolute metabolite concentrations.
Y-matrix/vector:
– Prediction analysis: Categorical clinical endpoint.
– Regression analysis: Continuous clinical endpoint.
2.2 Software There are several different softwares available for univariate or
multivariate analysis of metabolomics data, differing in their flexi-
bility and user-friendliness (see Table 1). Software such as Matlab
and R can be used for all types of data analysis, but require that the
user has knowledge on programming.
3 Methods
3.1 Multivariate Unsupervised methods are exploratory and useful tools for getting
Analysis to know your dataset in terms of possible groupings, patterns, and
outliers, without taking a response variable into account. Examples
3.1.1 Unsupervised of common methods are principal component analysis (PCA) and
Methods hierarchical cluster analysis (HCA).
Principal Component PCA is a method that through linear combinations of the original
Analysis (PCA) independent variables X, constructs a new lower dimension coor-
dinate system made up by latent variables (LVs), which in PCA are
called principal components (PCs) [11]. These variables explain
variance within the dataset with the aim of capturing the main
trends in the data. The position of each sample in the new coordi-
nate system is reflected by the scores matrix (T), while the influence
of the original variables on the PCs is defined by the loadings matrix
(P) such that:
X ¼ TPT þ E ð1Þ
where E is the residual matrix of variance not explained by the
model, and T indicates the transpose of a matrix. The results can
thus be visualized in scores and loadings plots (see Fig. 2).
Protocol for PCA 1. Additional preprocessing of variables. Although the spectral data
was preprocessed prior to data analysis, PCA is sensitive to the
scaling of the variables.
Spectral data: Mean center the data by subtracting the variable
mean from each variable value to make the mean zero. Mean
centering of spectral data removes the offset from each variable
Table 1
Examples of available software and interfaces to perform multivariate and/or univariate
metabolomics analyses described here
Software/ Methods
Interface Reference/URL Implemented
Amix https://www.bruker.com/products/mr/nmr/nmr-software/ Multivariate
software/amix/overview.html
Knimea https://www.knime.org/knime Univariate and
multivariate
Matlab http://www.mathworks.com/products/matlab/ Univariate and
multivariate
MetaboAnalysta http://www.metaboanalyst.ca/faces/docs/Format.xhtml Univariate and
multivariate
PLS toolboxb http://www.eigenvector.com/software/pls_toolbox.htm Multivariate
R https://www.r-project.org/ Univariate and
multivariate
SIMCA http://umetrics.com/products/simca Multivariate
SIRIUS http://www.prs.no/Sirius/Sirius.html Multivariate
SPSS http://www-01.ibm.com/software/analytics/spss/products/ Univariate
statistics/
STATA http://www.stata.com/ Univariate and
multivariate
The http://www.camo.com/rt/Products/Unscrambler/ Univariate and
Unscrambler unscrambler.html multivariate
a
Uses R packages
b
Requires Matlab
so that PC1 will not capture the mean of the data but the
direction of maximum variance.
Quantified metabolites: Autoscale the metabolite concentrations
by normalizing each value to the variable standard deviation
after mean centering. Autoscaling allows each variable to have
the same influence on the model, and the resulting variables have
mean zero and standard deviation of one (see Note 2).
2. Perform PCA using the software of choice (see Table 1).
3. Select number of components to include in the model. There are
two alternative approaches:
Cumulative variance plot: Evaluate the cumulative variance
explained by the model with increasing number of PCs. Choose
the number of PCs that explain a certain predetermined amount
of variance (see Note 3).
Fig. 2 Result from principal component analysis of breast cancer tissue from two different groups. The two
classes are perfectly separated in the second principal component (PC2). Samples from class 2 have high PC2
scores compared to class 1, thus they have higher levels of the metabolite phosphocholine (PCho) and lower
levels of glycerophosphocholine (GPC) compared to the class 1 samples. The first principal component (PC1)
shows that the largest variation in the dataset is due to differences in lipid concentrations between the
samples, as the samples to the right in the scores plot, having high scores on PC1, have high lipid levels
compared to the remaining samples
Fig. 3 Scree plot example. Two “knees,” marked by red arrows, are observed, suggesting two or four principal
components (PCs) to be the optimal number. In this case, the cumulative variance plot can aid in the
determination of the best “knee,” selecting the one that represents the number of PCs that explains a
certain predetermined amount of variance
Scree plot: Plot the variance explained by each PC (see Note 4).
The variance will decrease for each PC. Choose the PC repre-
senting the “knee” in the curve (see Fig. 3).
Fig. 4 Dendogram example. Samples, whose ID numbers are specified in the x-axis, are divided into six
clusters, shown in different colors, by manually setting a cutoff at height 150
4. Review the scores plot to look for possible groupings, e.g., of

clinical outcome, or outliers in your dataset.
Hierarchical Cluster HCA aims to find natural clusters among samples using a hierar-
Analysis (HCA) chical approach where samples are grouped according to calculated
similarities and dissimilarities. The result is visualized as a dendro-
gram (see Fig. 4). At the bottom of the dendrogram, all objects
represent individual clusters. For each level, the two closest objects
are joined into one cluster. This continues until all clusters are
joined by one branch. There are different measures for determining
the distance between individual samples or between clusters of
samples. Common measures for individual samples include Euclid-
ean distance, Manhattan distance, and sample correlations. Com-
mon measures for distance between clusters include single linkage,
average linkage, complete linkage, and Ward’s method. The
procedure is done automatically by most software. The steps for the

HCA algorithm are listed below.
Protocol for HCA 1. Calculate the distance between all possible pairs of clusters using
the chosen distance measure for individual samples.
2. Merge the two clusters with the smallest distance.
3. Calculate the new clusters’ distance to other clusters using a
chosen distance measure for clusters.
4. Repeat steps 2–3 until all samples are merged into one cluster.
Alternatively, a top down approach can be used where all
objects are considered one cluster initially and subsequently divided
into smaller clusters depending on their dissimilarities.
To decide which matrices are optimal for distance measure-
ments and assessing how well the dendrogram reflects your data,
the cophenetic correlation coefficient [12] can be used. This coeffi-
cient calculates the correlation of the original pairwise distance
between two objects and the level/height at which the two objects
were joined in one cluster.
The resulting dendrogram can be used to divide the samples
into clusters. The number of resulting clusters can be defined
beforehand or a cutoff can be set at a decided level of the dendo-
gram, either manually or using for instance Gap statistics [13]. All
the samples joined by branches below the cutoff are considered one
cluster. The resulting clusters can be evaluated in terms of clinical
endpoints of interest. Prediction of cluster labels for new samples
can be achieved based on the shortest distance to each of the cluster
centroids or using validated supervised models (e.g., PLS-DA, see
Subheading 3.1.2).
3.1.2 Supervised Supervised multivariate methods are used to identify correlations

Methods and build models that can predict characteristics of new data. These
methods model the relationship between the independent variables
or X-matrix (e.g., spectral data or quantified metabolites), and a
response variable(s) or Y-matrix/vector (e.g., clinical endpoints) by
identifying patterns in the input data and making decision rules that
can be applied to new data. Several supervised analysis methods
exist, some of which build models by linear combinations of the
input variables (e.g., partial least squares methods), while others
model more complex, nonlinear relationships (e.g., neural net-
works and support vector machines). An advantage of linear models
is that biological interpretation is more easily achievable.
1. In case of continuous response variables such as blood pressure
or tumor size, a supervised regression method that predicts a
response value should be used.
2. For categorical response variables (e.g., treatment and control

groups), a classification method that can assign samples to two
or more groups should be used.
A common feature of supervised analysis methods is that these
methods are prone to overfitting to the input data [14]. This would
result in a model that makes good predictions for the data used, but
predicts badly for new data. To avoid overfitted models and over-
optimistic results, validation is a crucial step for supervised analysis
methods. Validation procedures are described in Subheading 3.1.4.
Partial Least Squares (PLS) Partial least squares (PLS) is commonly used both for regression
Methods and for classification problems. PLS defines underlying structures
that maximize the covariance between the independent variables
and the response variable [15]. Instead of only modeling the inde-
pendent variables X, as is the case for PCA, the dependent response
variables Y are also modeled:
X ¼ TPT þ E ð2Þ
Y ¼ UQT þ F ð3Þ
where T and U are the score matrices, P and Q are the loading
matrices, and E and F are the residuals for X and Y, respectively. T
indicates the transpose of a matrix. The X-scores, T, are predictors
of Y and will also model X, thus both X and Y are assumed to be
modeled by the same latent variables. Hence, Y can be written as
Y ¼ TGQT þ F ð4Þ
where G is the diagonal matrix resulting from U ¼ TG.
There are several algorithms that can be used to estimate these
parameters, all of which provide more or less similar results [16] .
The covariance between X and Y is optimized by defining PLS
LVs, which are linear combinations of the original X variables, and
the dimensionality of the resulting PLS model is equal to the
number of LVs used in the model. The optimal number of LVs to
use is chosen based on different model diagnostic terms used to
evaluate the overall quality of the model for different numbers of
LVs. For PLSR, the Q2 statistic (Eq. 5) and the root mean square
error (RMSE) (Eq. 6) are typically used. These statistics reflect the
^
differences between the predicted value ( y ) and the known y:
P ^ 2

i y i y i
Q2 ¼ 1 P 2 ð5Þ
i y i y
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uP
u ^ 2
t i yi y i
RMSE ¼ ð6Þ
m
where i ¼ 1,2,. . .,m for m included samples and yis the mean of all y
values. If RMSE is to be compared between datasets, the value
should be normalized (NRMSE) to make it independent of scale
(Eq. 7):
RMSE
NRMSE ¼ ð7Þ
y max y min
A Q2 of 1 corresponds to perfect prediction, while a very low or
negative Q2 indicates a poor prediction. A NRMSE value of 0 cor-
responds to perfect prediction, while a value close to 1 indicates
poor prediction.
For PLS-DA, commonly used diagnostic statistics include the
classification error, sensitivity, and specificity. The number of cor-
rectly classified samples, i.e., true positives (TP) and true negatives
(TN), and the number of incorrectly classified samples, i.e., false
negatives (FN) and false positives (FP), is subsequently recorded.
The prediction error (see Note 5) relates the number of incorrectly
classified samples with the total number of samples (Eq. 8). The
model accuracy equals one minus the classification error.
Number of incorrectly classifed samples
Error ¼
Total number of samples
FN þ FP
¼ ð8Þ
TP þ TN þ FN þ FP
Sensitivity measures the ability to correctly predict the case
class, or true positive samples (Eq. 9). A highly sensitive model is
one that generates few false negatives.
Number of correctly classified cases

Sensitivity ¼
Total number of cases
TP
¼ ð9Þ
TP þ FN
Specificity measures the ability to correctly classify the control
class, or true negative samples (Eq. 10). A highly specific model is
one that generates few false positives.
Number of correctly classified controls

Specificity ¼
Total number of controls
TN
¼ ð10Þ
TN þ FP
The scores and loadings are calculated from the LVs, and are
used for biological interpretation of the data in similar manners as
for PCA.
Partial Least Squares PLSR is used for regression problems where the intention is to
Regression (PLSR) model correlations and/or make prediction on a continuous
response variable.
Protocol for PLSR 1. Perform additional scaling of variables as described in the proto-
col for PCA, step 1.
2. Perform PLSR using the spectral data or quantified metabolites
as independent variables X and the sample characteristic to be
modeled as a continuous response variable Y. Depending on the
number of samples in your dataset, choose the type of cross-
validation that suits your dataset (see Subheading 3.1.4). Make
PLSR models for a restricted number of LVs (see Note 6).
3. Examine the cross-validated RMSE or Q2 for the different num-
bers of LVs.
4. Choose the number of LVs giving the first minimum in cross-
validated RMSE or the first maximum in Q2 (see Note 7).
5. If you have an independent validation set, make predictions of
the independent test set using the model obtained in step 4.
Partial Least Squares PLS-DA is used for classification problems where the intention is to
Discriminant Analysis model correlations and/or make prediction between two or more
(PLS-DA) groups of samples.
Protocol for PLS-DA 1. Perform additional scaling of variables as described in the Pro-
tocol for PCA, step 1.
2. Perform PLS-DA using the spectral data or quantified metabo-
lites as independent variables X and the sample characteristic to
be modeled as a categorical response variable Y, with discrete
numbers representing each class (e.g., 1 for treatment group and
2 for control group). Depending on the number of samples in
your dataset, choose the type of cross-validation that suits your
dataset (see Subheading 3.1.4). Make PLS-DA models for a
restricted number of LVs (see Note 6).
3. Examine the cross-validated classification error for the different
numbers of LVs.
4. Choose the number of LVs giving the first minimum in cross-
validated classification error (see Note 7).
5. If you have an independent validation set, make predictions of
the independent test set using the model obtained in step 4.
Orthogonal PLS (OPLS) In orthogonal PLSR (OPLSR) and OPLS-DA, the response
orthogonal variations in X are separated out before the model is
built. Orthogonalizing the model gives identical model perfor-
mance to that of original PLS as the OPLS components are
rotations of the original ones. However, interpretations are easier as

all relevant information will be present in the first LV, hence only
the first score and loading vector will be used for interpretations.
Multilevel PLS-DA For longitudinal or cross-over studies, where each individual serves
as its own control, multilevel PLS-DA can be used to separate the
within-patient variation from the between-patient variation
[17]. By focusing on the within-patient variation, representing
metabolic changes due to the intervention (e.g., samples before
and after treatment), metabolic changes that would otherwise be
masked by the often much larger between-patient variation can be
revealed. In the example of samples of the same individuals before
and after intervention, the within-patient variation would be sepa-
rated according to:
Control ¼ A B ð11Þ
Intervention ¼ B A ð12Þ
where A is the metabolic data before intervention and B is the
metabolic data after intervention. These new matrices from
Eqs. 11 and 12 are concatenated and used as independent variables
in PLS-DA with a categorical variable representing control and
intervention as the Y vector.
Other Multivariate Analysis In addition to PLS-based methods, several other multivariate anal-
Methods ysis methods are suitable for the analysis of breast cancer metabo-
lomics data. Neural networks (NNs) can model complex, nonlinear
relationships between the input variables and the problem to be
solved, and can be used for both regression and classification pro-
blems [18]. NNs consist of three or more layers: an input layer, one
or several hidden layers, and an output layer. The nodes of each
layer are connected through weights, and the weights and hidden
layer(s) will be adapted to the input data through learning. Another
suitable method is support vector machines (SVMs) [19]. SVMs do
not learn like NNs, but instead aim to find boundaries that separate
different groups. The boundary determined by SVMs will be a line
in 2D, a plane in 3D, or a hyperplane in n dimensions. By choosing
different kernel functions, SVMs can be applied to nonlinear pro-
blems by transformation of the input space into a higher dimension
where the classes are linearly separable. Although these methods
can be powerful for making predictions, a main drawback of NNs
and SVMs is the difficulty in interpreting the resulting models.
3.1.3 Variable Selection Metabolic datasets are made up of several variables, or columns,
each one representing a point in a spectrum or an individual metab-
olite. Variables that are biologically irrelevant add noise to the
model and can impair model performance. A variable selection
procedure is therefore often performed when analyzing
metabolomics data. Although variable selection can be carried out

using several methods that evaluate individual variables or subsets
of them at a time, all of them use information obtained from
statistical modeling (e.g., variable importance in projection [20],
prediction error [21], selectivity ratio [22]) as a variable importance
measure. It is therefore important to perform suitable validation of
the selected variables (see Subheading 3.1.4).
Protocol for Variable 1. Define the variable selection method to use. For an overview of
Selection available methods, see Ref. 23.
2. Perform variable selection using the algorithm or script defined.
3. Build models using only the selected variables.
It is worth mentioning that if many variables provide the same
information, only one (or very few) of these will be selected to
minimize redundancy. Highly correlated metabolites involved in
the same pathway could thus be discarded while still being biologi-
cally important.
3.1.4 Validation Supervised multivariate models (see Subheading 3.1.2) tend to

overfit the data, which means they may capture even the noise of
the dataset used for building, or training, the model. If this occurs,
the model will perform well only for the training samples, and not
be applicable to new, similar samples [14]. Proper validation is
therefore essential to assess model quality and robustness. A second
purpose for validation is to optimize model parameters, such as
optimal dimensionality, e.g., the number of LVs in PLS or the
number of hidden layers in neural networks.
Cross Validation Multivariate model validation is typically performed using training

and test data. The test set is a group of samples with known
independent and dependent variable values not used for model
building, while the remaining data constitutes the training set
used for modeling. The training and test set should be representa-
tive of each other, i.e., sampled from a similar population and
handled identically. The resampling procedure known as cross vali-
dation [24] is commonly used for defining training and test data,
building models, and if applicable, optimizing model parameters.
Using the test sets generated with cross validation (CV) to simulta-
neously assess the quality of the model and to determine the
optimal dimensionality (e.g., number of LVs for PLS) is prone to
produce an over-optimistic error as the validation for both purposes
should be independent of each other. It is therefore preferable to
use a completely independent validation set comprised of a large
group of new samples for final validation. However, due to, e.g.,
budget, technical, or time constraints, the number of available
samples is often not sufficient to do this, and the model perfor-

mance will be based on the cross-validation results.
Protocol for Cross 1. Divide your data into k subsets.

Validation 2. Exclude the first subset as a test set, and make a multivariate
model of the remaining data (the training data).
3. Test the performance of the resulting model on the test set.
4. Repeat steps 2–3 for all k subsets.
5. Assess the mean model diagnostic statistic (e.g., RMSE or pre-
diction error) for all k test sets. If applicable determine the
correct dimensionality or other parameters of the model as
described in the Protocols for PLSR and PLS-DA, steps 2–4.
6. If an independent validation set is present, build a model on all
data from step 1, using the dimensionality or parameters deter-
mined in step 5, and test the model using the independent
validation set.
The above procedure is designated k-fold CV, according to the
number of subsets, or folds (k) defined. Defining k to be the total
number of samples results in leave-one-out (LOO) CV. This pro-
cedure is particularly useful for datasets with low sample number
(see Note 8).
Alternative to defining a number of folds to divide the data
into, a fixed number or percentage of samples to be left out from
model building for each n validation can be defined. By using
random sampling, a different total error will be obtained if the
CV procedure is repeated, since the test sets will vary at random
with each repetition. This provides a less biased validation result.
Alternative to using an independent validation set, a double CV
procedure should be employed when there are a sufficient number
of samples. The procedure includes an inner loop nested in an outer
loop for separate model parameter optimization and model quality
assessment (see Fig. 5) and is carried out as follows:
Protocol for Double CV 1. Divide your data into k subsets.

2. Exclude the first subset as a validation set, and use the remaining
data as input for the inner CV loop.
3. Perform the inner CV loop as steps 1 through 5 described for a
typical single-layered k-fold CV above (see Note 9).
4. Build a model on all data inputted in the inner CV loop, using
the dimensionality or parameters determined in step 3, and test
the model using the excluded validation set in step 2.
5. Repeat steps 2–4 for all k subsets.
6. Assess the mean model diagnostic statistic (e.g., RMSE or pre-
diction error) for all k validation sets.
Fig. 5 Illustration of data splitting for a four-fold double cross validation procedure through which the number
of latent variables (LVs) is optimized in the inner loop and model quality is assessed in the outer loop. Samples
are divided into four different outer loop groups or folds (k ¼ 1–4). At each outer loop repetition, three folds
comprise the data input for the inner loop, while one is left out as a validation set. The inner loop is then
partitioned into four inner loop folds (k2 ¼ 1–4), which at each inner loop repetition alternate the role of test
set while the remaining folds comprise the training set. The samples comprising the validation set in the outer
loop are therefore unseen to the latent variable optimization procedure, reducing the risk of over optimistic
results when using them to assess the model built with the inner loop data. A classical, single-layered CV
procedure consists only of the outer loop, with the inner loop data being the training set
Model quality assessment employing the described double CV

procedure is thus performed using validation samples completely
unseen during LV optimization, reducing the risk of overfitting.
Double CV is usually not available for direct implementation in
data analysis software without the creation of in-house scripts being
required. In most multivariate analysis software, however, the typi-
cal single-layered k-fold CV previously described can be performed
using ready-made scripts or graphical user interphases (GUIs)
which provide the option to use an independent validation set to
assess the quality of the cross-validated model. It is important to
emphasize that when optimizing model parameters using single-
layered CV, unless model quality is assessed using new samples,
diagnostic statistics obtained cannot be considered reliable.
Permutation Testing To ensure that obtained model diagnostic statistics are significantly
better than those that would be obtained by chance, permutation
testing can be performed. By rearranging the y response variable in
a random order, the y continuous values or classes are no longer
associated with their true corresponding metabolic information
(X); thus, any relationships between X and y are lost [24]. The
procedure can be performed to evaluate a double CV procedure as
follows.
Protocol for Permutation 1. Permute or rearrange the values in the original y variable in a
Testing random order to obtain ypermuted. Replace the original y variable
with ypermuted.
2. Perform steps 1 through 5 described for a typical single-layered

k-fold CV above using the optimized parameter value deter-
mined for the model being tested.
3. Repeat steps 1–2 a defined number of times (nperm) (see Note
10). A total of nperm errors from different permuted models will
be obtained.
4. Count the number of permuted models achieving an equal or
better diagnostic statistic than the unpermuted model being
tested and define it as a (e.g., errornonperm errorperm).
5. Calculate a p-value as: a/nperm (see Note 11).
3.2 Univariate Univariate analysis can be performed to search for statistically sig-
Analysis nificant differences in individual metabolites between groups.
3.2.1 Selection Criteria Prior to univariate analysis, it should be decided whether the data is
for Univariate Tests prone to parametric or nonparametric tests. This is decided based
on at least three check points: normality (see Note 12), homogene-
ity of variances, i.e., homoscedasticity (in case of heteroscedasticity,
see Note 13), and independency of samples (for dependent sam-
ples, e.g., repeated measurements, samples from the same hospital,
etc., linear mixed-effects models can be used (see Subheading
3.2.2)). Figure 6 shows a simplified overview of tests to select
according to data distribution and number of groups to test. For
more extensive details regarding selection criteria for univariate
tests, refer to [25].
3.2.2 Linear Mixed- Linear mixed-effects models (LMM) are an extension of general
Effects Model linear models taking into account both fixed and random effects,
where fixed effects often are those of primary interest, e.g., effect of
treatment type, while random effects are results of random selec-
tion, e.g., age, hospital, or individual. The modeling of random
effects enables inclusion of repeated measurements. An additional
advantage is that LMM can handle missing values, thus improving
the power in multilevel analysis where some observations are
missing.
In longitudinal studies, where samples have been collected
from individuals over time, LMM can be used to evaluate which
metabolites are significantly different with respect to one or more
outcomes of interest. In such cases, metabolite levels are set as
individual response variables, clinical outcome as a fixed effect,
and patient number as a random effect.
To perform LMM, first define the fixed and random effects.
Categorical fixed effects are set as factors. To decide whether or not
to model interactions between the fixed effects, a likelihood ratio
test comparing the reduced model (without interactions) to the full
Fig. 6 Examples of univariate tests that can be used for evaluating group differences in the level of quantified
metabolites
model (with interactions) can be performed. The resulting p-value

will reflect the significance of the interaction.
1. Perform LMM using the software of choice.
2. Check that residuals are normally distributed. If not: try trans-
forming the data to achieve normally distributed residuals.
The resulting p-values from LMM indicate the significance of
the fixed effects after the correction of the random effect(s). Fur-
thermore, LMM estimates show metabolite increasing or decreas-
ing trends for continuous fixed effects or whether metabolites are
higher or lower in one level or group compared to another for
categorical fixed effects.
3.2.3 Multiple Testing When performing tests to associate a p-value to each individual
Correction metabolite separately, such as those described in Subheadings
3.2.1 and 3.2.2, the same test is repeated for all metabolites. The
likelihood of significant p-values being achieved by chance will
increase with the number of tests performed. Hence, the number
of false positives (i.e., type I errors) should be controlled for. Here
lies the purpose of multiple testing corrections, which can be
achieved via different approaches. A widely used approach is the
Bonferroni adjustment [26].
Protocol for Bonferroni 1. Generate p-values for all n metabolites using a suitable
Adjustment statistical test.
2. Multiply each p-value by n.
The Bonferroni method controls for the family-wise error rate
(FWER), which is the probability of producing at least one false
positive. Although simple, the Bonferroni adjustment is generally
unnecessarily strict for the purposes of metabolic analyses. Alterna-
tively, one can implement less stringent correction methods that
control for the false discovery rate (FDR), which is the expected
proportion of false positives to be generated. One such method is
the Benjamini-Hochberg adjustment [27, 28], whose procedure is

as follows.
Protocol for Benjamini- 1. Input p-values and record their order, referred to as the “original
Hochberg Adjustment: order.”
2. Rank and sort the inputted p-values in an ascending order, such
that the rank i of the smallest p-value is 1, the second smallest has
an i ¼ 2, etc.
3. Calculate an intermediate q-value (qint) for each sorted p-value
( pval): qint ¼ ( pval/i)n, where n is the number of inputted p-
values.
4. Sort the qint in an ascending order, recording their
corresponding p-value ranks.
5. The sorted qint values will now be adjusted according to their p-
value rank. The first qint value (qint0) remains the same. If the
rank of the second sorted qint value (qint1) is lower than that of
the previous value (qint0), overwrite qint1 with qint0. Next, look to
the rank of qint2. If its rank is lower than qint1 then replace qint2
with the new value of qint1, if the rank is higher, then qint2
remains unchanged. Next, compare the rank of qint2 with qint3
and repeat the previous steps until all qint values have been
adjusted. The result is a list of the final q-values (see Note 14).
6. Reorder the final q-values so that they correspond to the original
order of the inputted p-values recorded in step 1.
The adjusted p-value (q-value) represents the smallest FDR at
which the corresponding test will be significant. So for a q-value of
0.02, the test would be considered significant (null hypothesis
rejected) when allowing a maximum of 2% of all significant tests
to be false positives (i.e., FDR threshold is 2%). As for all statistical
tests, the desired FDR threshold value to base significance on
should be defined prior to testing.
3.3 Multivariate An overview of the key steps to analyze metabolic profiles in breast
Versus Univariate cancer using both multivariate and univariate methods has been
Analysis provided. To conclude, a comparison of these methods regarding
their advantages and disadvantages is presented in Table 2.
4 Notes
1. Refer to the following for example studies where metabolic

signature was related to specific clinical endpoints in breast
cancer: hormone receptor status and axillary lymph node status
[29] and treatment response and 5-year survival [30] studying
Table 2
Advantages and disadvantages of univariate and multivariate methods
Advantages Disadvantages
Univariate l Widely used/known in all scientific fields. l Accurate measure of absolute or relative
methods l Usually simple and straightforward to concentrations is essential.
perform and interpret. l Untargeted approaches present
l Useful for targeted approaches when one challenges, particularly the risk of false
or a few metabolites have been defined to discoveries increasing with increasing
be tested. number of univariate paralleled tests
l Variables (i.e., metabolites) do not affect performed. Although this can be
the outcome of each other’s tests (with the addressed by applying multiple testing
exception of multiple testing procedures). corrections, these in turn may be too
strict, thereby risking to miss a true
discovery.
l Does not account for variable correlation.
Multivariate l Useful for exploratory purposes, such as l Not widely known in clinical fields
methods outlier detection. l Computationally intensive, time-
l Applicable for untargeted approaches as consuming algorithms
they can handle large numbers of variables l Interpretation might not be
and evaluate their importance, i.e., straightforward

relevance to the scientific problem at hand. l Unimportant variables that are mainly
l Takes proper account of the correlation noise can obscure information from
between spectral points/metabolites. important variables that would be
l No need to correct for multiple testing, as detected using univariate tests.
all variables are analyzed simultaneously. l When using the metabolic profile as
l Evidence of individual metabolites can input, scaling will increase the influence of
accumulate to reveal findings that would the noise and might not be optimal.
not be detected separately with univariate Thus, differences in metabolites of lower
methods. abundance may be obscured by those of
l Quantification not necessary higher abundance.
tissue; early recurrence [31, 32] and weight change [33] study-
ing serum; risk of disease development [34] studying plasma.
2. Autoscaling should not be performed on spectral data, as this
will scale up the noise regions between metabolite signals.
3. Choosing a number of PCs explaining 80–90% of the data
variation will usually be sufficient to get a good overview of
the data.
4. Alternatively, scree plots can be plotted as the eigenvalue versus
the corresponding principal component (PC). An eigenvalue
describes the amount of variance accounted for by its associated
PC [35].
5. For unbalanced datasets, i.e., those with very few samples of
one class and many of the other, the prediction error may be
misleading. For example, if 90% of samples in a dataset are of
class A, a model that predicts every sample as class A will achieve
an error of 0.1. In these cases, the sensitivity and specificity

provide a more correct assessment.
6. A maximum of 20 LVs is sufficient for most datasets.
7. In certain cases, outlying samples might give a small increase in
RMSE/classification error or a decrease in Q2 before the
RMSE/classification error continues to decrease or the Q2
continues to increase. If a sample appears to be an outlier in a
score plot, try to remove this sample and see if that changes the
results.
8. Typically n < 20 samples is too few to perform a cross valida-
tion other than LOOCV.
9. Data inputted in the inner loop is partitioned into a training set
to build models and a test set to assess the models built at each
iteration.
10. nperm is usually at least 1000 so that the obtained p-values are in
the thousandth order of magnitude (103).
11. If a ¼ 0, i.e., no permuted model performed better than the
unpermuted model, report the p-value as lower than 1/nperm.
For example, for nperm ¼ 1000, p < 1/1000, p < 0.001.
12. Statistical tests or graphical visualization can be used to evalu-
ate data distribution. Examples of statistical tests used to check
normality are Shapiro-Wilk or Kolmogorov-Smirnov
[36]. Graphical visualization can be performed by plotting
histograms (symmetrical and bell-shaped indicates normal dis-
tribution) or normal probability plots (q-q plot) (a line at y ¼ x
indicates normal distribution). For non-normally distributed
data, see Note 13.
13. Transformation of the data (e.g., log transformation) can be
applied to allow for parametric testing. Alternatively, nonpara-
metric tests can be chosen.
14. Due to the overwriting of q-values based on rank, it is typical
for one or more q-values to be repeated when adjusting using
this method.
References
1. Clarke CJ, Haselden JN (2008) Metabolic 3. Keeler J, Understanding NMR (2010) Spec-
profiling as a tool for understanding mechan- troscopy, 2nd edn. Wiley, Chichester, UK
isms of toxicity. Toxicol Pathol 36 4. Hu K, Westler WM, Markley JL (2011) Simul-
(1):140–147. https://doi.org/10.1177/ taneous quantification and identification of
0192623307310947 individual Chemicals in Metabolite Mixtures
2. Bujak R, Struck-Lewicka W, Markuszewski MJ, by two-dimensional extrapolated time-zero
Kaliszan R (2015) Metabolomics for labora- (1)H(13)C HSQC (HSQC(0)). J Am Chem
tory diagnostics. J Pharm Biomed Anal Soc 133(6):1662–1665. https://doi.org/10.
113:108–120. https://doi.org/10.1016/j. 1021/ja1095304
jpba.2014.12.017. 5. Nicholson JK (1989) High resolution nuclear
magnetic resonance spectroscopy in clinical
chemistry and disease diagnosis. In: den Boer intervention study. J Proteome Res 7
NC, van der Heiden C, Leijnse B, Souverijn (10):4483–4491. https://doi.org/10.1021/
JHM (eds) Clinical chemistry, an overview. Ple- pr800145j
num Press, New York, NY 18. Brougham DF, Ivanova G, Gottschalk M, Col-
6. Le Gall G (2015) NMR spectroscopy of bio- lins DM, Eustace AJ et al (2011) Artificial neu-
fluids and extracts. In: Bjerrum TJ ral networks for classification in metabolomic
(ed) Metabonomics: methods and protocols. studies of whole cells using 1H nuclear mag-
Springer, New York, NY, pp 29–36 netic resonance. J Biomed Biotechnol 2011:8.
7. Giskeødegård GF, Cao MD, Bathen TF (2015) https://doi.org/10.1155/2011/158094.
High-resolution magic-angle-spinning NMR 19. Gromski PS, Muhamadali H, Ellis DI, Xu Y,
spectroscopy of intact tissue. In: Bjerrum TJ Correa E, Turner ML et al (2015) A tutorial
(ed) Metabonomics: methods and protocols. review: metabolomics and partial least squares-
Springer, New York, NY, pp 37–50 discriminant analysis – a marriage of conve-
8. Bathen TF, Geurts B, Sitter B, Fjøsne HE, nience or a shotgun wedding. Anal Chim Acta
Lundgren S, Buydens LM et al (2013) Feasi- 879:10–23. https://doi.org/10.1016/j.aca.
bility of MR metabolomics for immediate anal- 2015.02.012.
ysis of resection margins during breast cancer 20. Wold S, Johansson E, Cocchi M (1993) PLS:
surgery. PLoS One 8(4):e61578. https://doi. partial least-squares projections to latent struc-
org/10.1371/journal.pone.0061578 tures. In: Kubinyi H (ed) 3D QSAR in Drug
9. Vettukattil R (2015) Preprocessing of raw Design. ESCOM, Leiden, The Netherlands, pp
metabonomic data. In: Bjerrum TJ 523–550
(ed) Metabonomics: methods and protocols. 21. Li H-D, Zeng M-M, Tan B-B, Liang Y-Z, Q-S X,
Springer, New York, NY, pp 123–136 Cao D-S (2010) Recipe for revealing informative
10. Euceda LR, Giskeødegard GF, Bathen TF metabolites based on model population analysis.
(2015) Preprocessing of NMR metabolomics Metabolomics 6(3):353–361. https://doi.org/
data. Scand J Clin Lab Invest 75(3):193–203. 10.1007/s11306-010-0213-z
https://doi.org/10.3109/00365513.2014. 22. Rajalahti T, Arneberg R, Kroksveen AC,
1003593 Berle M, Myhr K-M, Kvalheim OM (2009)
11. Wold S, Esbensen K, Geladi P (1987) Principal Discriminating variable test and selectivity
component analysis. Chemometr Intell Lab ratio plot: quantitative tools for interpretation
Syst 2(1):37–52. https://doi.org/10.1016/ and variable (biomarker) selection in complex
0169-7439(87)80084-9. spectral or chromatographic profiles. Anal
12. The Mathworks Inc. Cophenetic correlation Chem 81(7):2581–2590. https://doi.org/10.
coefficient. http://www.mathworks.com/help/ 1021/ac802514y
stats/cophenet.html#zmw57dd0e176726. 23. Mehmood T, Liland KH, Snipen L, Sæbø S
Accessed 13 Apr 2016 (2012) A review of variable selection methods
13. Tibshirani R, Walther G, Hastie T (2001) Esti- in partial least squares regression. Chemometr
mating the number of clusters in a data set via Intell Lab Syst. 118:62–69. https://doi.org/
the gap statistic. J R Stat Soc Series B Stat 10.1016/j.chemolab.2012.07.010.
Methodol 63(2):411–423. https://doi.org/ 24. Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ,
10.1111/1467-9868.00293 Smilde AK, Velzen EJJ et al (2008) Assessment
14. Hawkins DM (2004) The problem of Overfit- of PLSDA cross validation. Metabolomics 4
ting. J Chem Inf Comput Sci 44(1):1–12. (1):81–89. https://doi.org/10.1007/
https://doi.org/10.1021/ci0342472 s11306-007-0099-6
15. Wold S, Sjöström M, Eriksson L (2001) 25. Riffenburgh RH (2006) Statistics in medicine,
PLS-regression: a basic tool of chemometrics. 2nd edn. Elsevier Academic Press, Burlington,
Chemometr Intell Lab Syst 58(2):109–130. MA
https://doi.org/10.1016/S0169-7439(01) 26. Bonferroni CE (1936) Teoria statistica delle
00155-1. classi e calcolo delle probabilità, Pubblicazioni
16. Andersson M (2009) A comparison of nine del R Istituto Superiore di Scienze Economiche
PLS1 algorithms. J Chemometr 23 e Commerciali di Firenze, vol 8. Seeber, Fire-
(10):518–529. https://doi.org/10.1002/ nze, pp 3–62. doi:citeulike-article-id:1778138
cem.1248 27. Benjamini Y, Hochberg Y (1995) Controlling
17. van Velzen EJJ, Westerhuis JA, van Duynhoven the false discovery rate: a practical and powerful
JPM, van Dorsten FA, Hoefsloot HCJ, Jacobs approach to multiple testing. J R Stat Soc Series
DM et al (2008) Multilevel data analysis of a B Stat Methodol 57(1):289–300
crossover designed human nutritional
28. Benjamini Y, Yekutieli D (2001) The control of negative early breast cancer at increased risk of
the false discovery rate in multiple testing disease recurrence. Results from a retrospective
under dependency. Ann Stat 29(4):1165–1188 study. Mol Oncol 9(1):128–139. https://doi.
29. Giskeødegård GF, Grinde MT, Sitter B, Axel- org/10.1016/j.molonc.2014.07.012.
son DE, Lundgren S, Fjøsne HE et al (2010) 33. Keun HC, Sidhu J, Pchejetski D, Lewis JS,
Multivariate modeling and prediction of breast Marconell H, Patterson M et al (2009) Serum
cancer prognostic factors using MR metabolo- molecular signatures of weight change during
mics. J Proteome Res 9(2):972–979. https:// early breast cancer chemotherapy. Clin Cancer
doi.org/10.1021/pr9008783 Res 15(21):6716–6723. https://doi.org/10.
30. Cao MD, Giskeødegård GF, Bathen TF, 1158/1078-0432.ccr-09-1452
Sitter B, Bofin A, Lønning PE et al (2012) 34. Bro R, Kamstrup-Nielsen MH, Engelsen SB,
Prognostic value of metabolic response in Savorani F, Rasmussen MA, Hansen L et al
breast cancer patients receiving neoadjuvant (2015) Forecasting individual breast cancer risk
chemotherapy. BMC Cancer 12(1):1–11. using plasma metabolomics and biocontours.
https://doi.org/10.1186/1471-2407-12-39. Metabolomics 11(5):1376–1380. https://doi.
31. Asiago VM, Alvarado LZ, Shanaiah N, Gowda org/10.1007/s11306-015-0793-8
GAN, Owusu-Sarfo K, Ballas RA et al (2010) 35. Bernstein IH, Garvin CP, Teng GK (1988)
Early detection of recurrent breast cancer using Applied multivariate analysis. Springer,
metabolite profiling. Cancer Res 70 New York, NY
(21):8309–8318. https://doi.org/10.1158/ 36. Ghasemi A, Zahediasl S (2012) Normality tests
0008-5472.can-10-1319 for statistical analysis: a guide for
32. Tenori L, Oakman C, Morris PG, Gralka E, non-statisticians. Int J Endocrinol Metab 10
Turner N, Cappadona S et al (2015) Serum (2):486–489. https://doi.org/10.5812/ijem.
metabolomic profiles evaluated after surgery 3505
may identify patients with oestrogen receptor
Part IV
Systems Biology of Metastasis and Tumor/Microenvironment

Interactions
Chapter 10
Stochastic and Deterministic Models for the Metastatic

Emission Process: Formalisms and Crosslinks
Christophe Gomez and Niklas Hartung
Abstract
Although the detection of metastases radically changes prognosis of and treatment decisions for a cancer
patient, clinically undetectable micrometastases hamper a consistent classification into localized or meta-
static disease. This chapter discusses mathematical modeling efforts that could help to estimate the
metastatic risk in such a situation. We focus on two approaches: (1) a stochastic framework describing
metastatic emission events at random times, formalized via Poisson processes, and (2) a deterministic
framework describing the micrometastatic state through a size-structured density function in a partial
differential equation model. Three aspects are addressed in this chapter. First, a motivation for the Poisson
process framework is presented and modeling hypotheses and mechanisms are introduced. Second, we
extend the Poisson model to account for secondary metastatic emission. Third, we highlight an inherent
crosslink between the stochastic and deterministic frameworks and discuss its implications. For increased
accessibility the chapter is split into an informal presentation of the results using a minimum of mathemati-
cal formalism and a rigorous mathematical treatment for more theoretically interested readers.
Key words Poisson process, Structured population equation, Metastasis, Mathematical modeling
1 Introduction
Metastasis is the spread of cancer cells to distant tissues broadly

divided into two steps, physical dissemination and tissue-specific
colonization [1]. While the first part is facilitated by a reversible
phenotypic change of cancer cells [2], successful colonization
involves complex tumor–microenvironment interactions and is
still not well understood [3, 4].
Being responsible for most cancer-related deaths, metastasis is a
pivotal point in disease history [5]. However, since metastases
smaller than approximately 107 cells remain undetectable by medi-
cal imaging or other diagnostic tools, the clinical appearance of
nonmetastatic disease may not reflect the true metastatic state of a
Both authors equally contributed to this work.
193
194 Christophe Gomez and Niklas Hartung
patient. Therefore, estimating the metastatic risk in cancer patients

without visible metastases is of major clinical importance [6]. In
this respect, mathematical and statistical techniques have the poten-
tial to derive risk scores from clinical data.
Today, there is a large body of mathematical oncology litera-
ture; a recent review specifically focuses on the metastatic process
[7]. Here, we will briefly summarize modeling efforts focusing on
analyzing metastatic risk.
The emergence of a metastatic phenotype is governed by a
number of key mutations [8, 9]. In [10] and [11] this assumption
is translated into mathematical models, thereby deriving a meta-
static risk score from evolutionary principles. In an opinion paper,
approaches explaining emergent behavior through lower-level
mechanisms were qualified as “the essence of integrated mathemat-
ical oncology” [12]. While acknowledging the importance of such
approaches for improving our understanding of cancer biology, we
will focus here on more data-driven models featuring simpler prin-
ciples and thereby better matching the limited amount of informa-
tion available in a clinical situation.
In an early work, a link between primary tumor size at surgery
and risk of recurrence was established from a large cohort of breast
cancer patients [13]. Later, these and other data were explained
heuristically [14] (see also Subheading 2.1). In addition to such
phenotypic characteristics, specific genetic signatures of the primary
tumor have been found to be associated with increased metastatic
risk [15]. These risk prediction models are static; they do not aim at
representing the time evolution of the disease.
In contrast, dynamic models allow to predict the modeled
system at different times, and more easily integrate data obtained
at different observation times. To represent the dependency of the
metastatic process on the primary tumor, a dynamic description of
primary tumor growth is also integrated into metastatic models.
The simplest dynamic model for tumor growth is the exponential
model, which adequately describes growth under no restrictions
(e.g., in vitro). In many cases of interest however, especially in vivo,
sigmoidal (s-shaped) models with an initial exponential phase and
subsequent deceleration are better suited to describe growth
dynamics. The Gompertz model is a classical example that has
been commonly used [16–18]. Power growth models have also
been used for the description of clinical [19] and preclinical
tumor growth data [20]. Although much more complicated mod-
els have been developed, e.g. describing the spacial evolution of a
tumor. Here we restrict our discussion of primary tumor growth to
the simple models introduced above.
A stochastic dynamic model for metastasis was proposed in
[21]. Their approach described the emission times of metastases
as random events, formalized through a so-called non-homoge-
neous Poisson process with an emission rate increasing with
Stochastic and Deterministic Metastatic Emission Models 195
primary tumor size (see Subheading 2.2). A variant of this model

successfully described data on bone lesions from a metastatic breast
cancer case [22]. The Poisson distribution allows for an interpreta-
tion of the emission process as being “memoryless” and many of its
properties can be analyzed mathematically.
Dynamic models for metastasis can also be used to infer parts of
the process that cannot be observed experimentally or clinically. In
this respect, a partial differential equation (PDE) model was pro-
posed to describe the size distribution of metastatic colonies [23]
(see Subheading 3.1). Thereby, they characterized the micrometa-
static state of a hepatocellular carcinoma patient from clinical infor-
mation on visible metastatic colonies. Later, this size-structured
model was also successfully incorporated into mixed-effects models
to predict the size evolution of metastases in animal models without
[24] and with [25] surgical removal of the primary tumor. A unique
feature of this approach is that it allows to integrate secondary
metastatic emission into the model. Indirect evidence on the capac-
ity of metastases to spread further was given both through cancer
network models [26–28] and through the discovery of self-seeding
mechanisms [29, 30].
In this chapter, we focus on the two dynamical frameworks for
metastasis described above, the Poisson process and the size-
structured model. Three aspects are covered:
l an accessible introduction to the Poisson process framework
with an example motivating the approach,
l an extension of the Poisson model to account for secondary
metastatic emission,
l the inherent link between the (extended) Poisson model and the
size-structured model.
Finally, we exploit the crosslink between the two frameworks in
order to evaluate the adequacy of the modeling assumptions in the
deterministic model and to realize simulations using both frame-
works together.
We restrict our discussion to models describing the natural
history of metastatic progression. While surgery of the primary
tumor can be represented in these models, to incorporate the effect
of systemic treatments a more general formalism would be
required. For the ease of presentation, we will not include this
layer of complexity here, although we point out that both frame-
works have been extended to cover much more general cases,
including systemic treatment [31, 32] and more complex interac-
tions between primary tumor and metastases [33].
For increased accessibility for readers with a non-mathematical
background, the present work is split into two parts. First, the
concepts behind these approaches are presented informally in Sub-
headings 2 and 3 breaking down the mathematical formalism as
much as possible. Subheading 5 contains rigorous definitions of all

mathematical objects and the precise statement of the mathematical
results; detailed proofs of these results are provided in Appendix.
2 A Probabilistic Framework for Metastatic Emission
2.1 Metastatic Risk Predicting the probability of metastatic disease at diagnosis of the
primary tumor is of major clinical importance since it is strongly
linked to survival expectancy. One possibility to build such a pre-
diction model is by using large databases to correlate information
on the presence of metastases to primary tumor characteristics at
diagnosis or surgery. As an example, we show a relationship estab-
lished in [14] between primary tumor size at surgery and probabil-
ity of metastasis based on clinical data on breast cancer:
ℙðno metastasesÞ ¼ expðc d z Þ, ð1Þ
where d is the largest diameter of the primary tumor at surgery, and

c, z > 0 are parameters which were determined from the cohort
data. At first view, these parameters are merely empirical and cannot
be associated to any mechanism. However, a mechanistic interpre-
tation as the combined effect of growth and emission dynamics is
possible, and is based on the following premises:
Power growth. The growth of the largest primary tumor
diameter d follows a power law, i.e. it is the
solution of the ordinary differential equa-
tion d0 (t) ¼ aPG d(t)α. In this equation,
aPG determines the growth speed and α
allows to describe different growth shapes:
exponential growth (α ¼ 1), linear growth
(α ¼ 0), and a spectrum of sigmoidal
growth patterns in between (0 < α < 1).
Power law of emission. During each (infinitesimal) time interval,
there is a chance that the primary tumor
emits a metastasis. The emission intensity λ
depends on the current size of the primary
tumor through a power law: λ(t) ¼ b d(t)β.
In this context, the parameter b can be inter-
preted as the metastatic aggressiveness of the
emitting tumor. In [23], β was linked to the
mode of vascularization of the primary
tumor: a uniform vascularization would cor-
respond to β ¼ 3 (dimension of space) and a
surficial vascularization to β ¼ 2 (dimension
of a surface) (see Note 1).
Memorylessness. The probability of emission of a metastasis
is independent of the previous emission
history.
A typical model allowing a mathematical formalization of the

above premises is the Poisson process. In principle, randomness of
the metastatic emission process could also be represented through
different probability laws, thereby dropping the memorylessness
property as done, e.g., by Bethge et al. [34]. However, the Poisson
model has several advantages. It does not require any additional
statistical parameters, it has a high degree of analytical tractability
(i.e., many of its properties can be investigated through mathemat-
ical analysis and not only by simulations), and there are efficient
numerical routines such as thinning to simulate the process (see,
e.g., [35]). The detailed derivation of the empirical relationship is
presented in Subheading 5.2.
2.2 Poisson A Poisson process (PP) is a model for counting a series of events
Processes occurring at random times. The precise definition of this process is
given in Subheading 5, but its basic properties are the two follow-
ing ones (see Fig. 1 for an illustration):
1. The number of events in disjoint time intervals is independent.
This translates the memorylessness property since given some
time t, the number of future events (those happening at any time
tfuture > t) does not depend on the past events (those happening
at any time tpast < t), but only depend on the present state of the
system at time t. For example, in Fig. 1 the time elapsed between
t and T(4) is independent of when exactly T(3) occurred. In other
words, the system forgot what happened up to time t.
2. The number of events Nt that occurred by time t has a Poisson
distribution with parameter
Zt
ΛðtÞ ¼ λðsÞds,
0
i.e., the integral over each emission intensity λ(s) for s in the time
4
Number of events
0
T (1) T (2) T (3) t T (4) Time
Fig. 1 Schematic trajectory of a Poisson process. Here, T(1), . . ., T(4) are the
times at which events occur, and by time t we have 3 events, i.e. Nt ¼ 3
interval [0, t]. This means that the probability of having observed
exactly k events until time t is given by
ΛðtÞk ΛðtÞ
ℙðN t ¼ kÞ ¼ e :
k!
These two properties characterize the PP, and can even serve as
a definition in addition to N0 ¼ 0. From these two properties, one
can show that the probability that the next event time lies between
times t and t þ Δt is approximately λðtÞΔt. Hence, λ determines the
event frequency, and this is the reason why it is called the intensity
function.
In the setting of this chapter, we are interested in describing the
inception times of new metastatic lesions via PPs. This means that
Nt is the number of metastases emitted until time t in our context.
Following [21, 22], we will first suppose that only the primary
tumor has the capacity of seeding metastases.
A constant emission intensity λ (called a homogeneous PP)
would mean that a tumor consisting of a few cells is equally likely
to shed a metastasis as a large tumor of several grams. Since such a
model is not realistic, we need to consider time-varying intensities λ
(called non-homogeneous PPs). We will consider an emission
intensity λ that depends on some measure of primary tumor size
Xp(t) (diameter, volume, number of cells, etc.). The relationship
between primary tumor size and emission intensity is given by a
size-dependent emission law γ, i.e. λ(t) ¼ γ(Xp(t)).
Before going into more detail, let us introduce a set of clinical
parameters (summarized in Table 1), which will be used through-
out this chapter to further illustrate those concepts. These para-
meters were estimated in [23] from clinical data on a hepatocellular
carcinoma with multiple liver metastases. Although derived
within the deterministic framework of the size-structured model
Table 1
Growth and emission laws derived in [23] from clinical data of a hepatocellular carcinoma case with
multiple liver metastases
Model Parameter Symbol Value Unit

Growth (Gompertz law) Initial size x0 1 Cells
1
g(x) ¼ aGompxlog(xp /x) Growth rate aGomp 0.00286 Days1
Maximum size x p1 7. 3 1010 Cells
8
Emission (power law) Rate constant b 5. 3 10 Days1 cells1
γ(x) ¼ bxβ Emission power β 0.663 –
0
Primary tumor size is given by Xp ¼ g(Xp), Xp(0) ¼ x0, and primary tumor emission rate is given by λ(t) ¼ γ(Xp(t)). This
set of parameters is used throughout the chapter; when used in the PP framework the emission rate λ is taken as the
emission intensity
0.18
0.16
0.14
Probability density
0.12
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25
Time (months)
Fig. 2 Probability density function (pdf) for the emission time of the first
metastasis T(1), using the clinical parameters in Table 1. The analytical
formula of the pdf is f T ð1Þ ðt Þ ¼ λðt Þe Λðt Þ
(see Subheading 3.1 for more details), the inherent link with the PP
framework described in Subheading 3.2 ensures that these para-
meters are also relevant in the PP model; we will therefore use the
same set of parameters in both frameworks. Also, we will make use
of a slight modification of this clinical setting to predict the risk of
distant metastasis after surgery. To represent the impact of a surgery
at time tsurgery, the emission intensity will be set to zero for all times
larger than tsurgery. Randomness of emission means that each emis-
sion time can be represented via its probability density function; this
is illustrated for the emission time of the first metastasis T(1) in
Fig. 2.
The number of metastases Nt is itself random in this model, but
relevant deterministic quantities can be derived from Nt, such as the
expected number of metastases ½N t or the probability of meta-
static disease ℙðN t > 0Þ. Exploiting the memorylessness property
of PPs, these quantities can be computed without any need to
simulate the process (all the following formulas are proven in
Appendix:
Zt
½N t ¼ λðsÞds ð2Þ
0
and
Zt
ℙðN t > 0Þ ¼ 1 expð λðsÞdsÞ:
0
Also, a formula for the variance of Nt is obtained readily:
20
Expectation
Variability
Stochastic trajectory
Number of metastases
15
10
0
0 0.5 1 1.5 2
Time (years)
Fig. 3 Illustration of the non-homogeneous Poisson process Nt, representing the

number of metastases emitted by the primary tumor (clinical parameters,
Table 1). To simulate the process, a set of random times is simulated, which
then yields a random trajectory. Repeated simulations would lead to different
trajectories, and for a large number of random trajectories the “average
trajectory” is approximately given by the expectation of the process ½N t ,
which can be directly computed via Eq. 2. Variability around ½N t can be
pffiffiffiffiffiffiffiffiffiffiffiffiffi
computed as ½N t p\pm2 var½N t , where var[Nt] is the variance of Nt,
computed via Eq. 3
Zt
var½N t ¼ λðsÞds: ð3Þ
0
The concepts Nt, ½N t and var[Nt] are illustrated in Fig. 3.
If a metastatic growth law is added to the model, the total
metastatic mass (or total cell count, sum of lesion volumes) Mt—
again a random quantity—can be represented via the emission times
of the PP. Mt can be compared to quantitative measures of total
metastatic biomass, obtainable, e.g., via bioluminescence imaging
[24]. We will assume that all metastases follow the same determin-
istic growth law Xm, but which can be different from the primary
tumor growth law. Therefore, the size difference among metastases
is entirely explained by differences in metastatic inception times,
and Mt can be written as
X
Nt
Mt ¼ X m ðt T ðkÞ Þ,
k¼1
where Xm(0) ¼ xm0 is the initial size of a metastasis. Expectation
and variance of the metastatic burden can also be calculated
analytically:
Zt
½M t ¼ λðsÞX m ðt sÞds: ð4Þ
0
Zt
var½M t ¼ λðsÞðX m ðt sÞÞ2 ds: ð5Þ
0
The assumption of equal growth law for the metastases greatly
simplifies the model, which is both an advantage (for identifiability
from clinical data) and drawback (for correct representation of
cancer biology). Beyond the scope of this chapter, it could be
replaced by a less restrictive assumption, e.g. by supposing that
individual growth parameters are drawn randomly from a given
probability distribution. However, even if easily integrated into
numerical algorithms, such a feature would be prohibitive for any
characterization of the model through mathematical analysis.
2.3 Secondary In the model described above, metastases do not have the capacity
Emission to emit metastases themselves. However, it is easy to think of a case
in which such a property would make a difference in the model. For
example, suppose that only a single metastasis is emitted prior to
surgery of the primary tumor (see Note 2). If this metastasis cannot
emit further metastases, its successful removal cures the patient but
the second surgery may fail if the metastasis is able to seed as well.
Of course, there are other mechanisms potentially leading to treat-
ment failure (e.g., local recurrence, surgery impossible, etc., see,
e.g., [36, 37]), but for simplicity these are not considered here.
In this section, we extend the previously shown PP model to
account for secondary metastatic emission using PPs as building
blocks. Many of the advantages and limitations of the PP model
carry over to the extended model, and we do not claim that a
comprehensive framework for cancer metastasis is built in that
way. The model is, however, simple enough to have a chance to
be parametrized reasonably from clinical data.
Conceptually, the extension is straightforward: as before, the
primary tumor grows according to Xp and metastatic emission by
the primary tumor is represented by a PP with intensity λp. In
addition, any emitted metastasis has the same capacities as the
primary tumor, but possibly with different growth and emission
rates (Xm instead of Xp, and λm instead of λp). If we consider a
metastasis emitted at time s, this means that at a later time t it
reaches the size Xm(t s) and emits metastases with intensity
λm(t s). Every newly emitted metastasis starts a new PP. Also,
each metastasis has a precursor (either the primary tumor or
another metastasis). The whole model then consists of the meta-
static emission times from all of these PPs. Since each PP can start
other PPs, we call this model a PP cascade. We then need to make a

hypothesis on how the different emission processes play together.
Independence of Emissions: We assume that each metastasis emits
independently of any other metastasis, and also independently of
the primary tumor. In other words, all the PPs involved in the
dynamics are independent.
Such an assumption is very important to be able to characterize
the properties of the PP cascade with mathematical techniques. Note
that simply ordering all emission events including secondary emissions
by increasing emission time would not allow to use the above made
independence assumption since the inception time of each metastasis
depends on its level in the generational hierarchy (primary tumor,
metastases emitted from the primary tumor, metastases emitted from
the metastases emitted from the primary tumor, etc.). Therefore, in
our model we have to account for the filiation of each metastasis. For
example, the emission time of the first metastasis emitted by the
primary tumor is denoted by T(1), and the emission time of the first
metastasis emitted by the first metastasis emitted by the primary tumor
is denoted by T(1,1), which depends on T(1). More precisely,
ð1, 1Þ ð1, 1Þ
T ð1, 1Þ ¼ T ð1Þ þ T~ where T~ is the first emission time for the
PP generated at time T(1). Filiation in the cascaded model is further
illustrated in Fig. 4, and a rigorous definition is provided in
Subheading 5.4.
T (1) T (2) T (3) T (4)
P P(λp ; 0)
T (1,1) T (1,2) T (1,3)
PP (λm ;T (1) )
T (2,1) T (2,2) T (2,3)
PP (λm ;T (2) )
T (1,1,1) T (1,1,2)
PP (λm ;T (1,1) )
Fig. 4 Illustration of the first three generations for a Poisson process (PP) cascade. Each long horizontal arrow
represents a PP (from top to bottom: primary tumor, first metastasis of first generation, second metastasis of
first generation, first metastasis of second generation emitted by first metastasis of first generation). Each
short vertical arrow represents an emission by the PP it points towards. This starts a new PP, connected by a
dashed line. In the notation PP(λ; T), λ is the intensity of the PP and T is its starting time for the new PP
(emission times are counted from the start of the respective PP and not from zero)
3 Crosslink to a Structured Population Model
3.1 Size-Structured Let us consider a different framework for the description of metas-
Model tasis, which also represents the metastatic process purely as growth
and emission dynamics. To describe the micrometastatic state of
cancer patients a size-structured model was developed [23]. The
model describes the time evolution of a density function ρ(x, t)
representing
R x2 the size distribution of metastatic colonies: the inte-
gral x 1 ρðx, tÞdx represents the number of metastases at time t with
size between x1 and x2. Therefore, ρ is like a smoothed histogram of
the number of metastases within different size ranges.
To better understand why a size density is considered, it is
instructive to draw an analogy to Lagrangian and Eulerian descrip-
tion of a fluid flow (see, e.g., [38] for a comprehensive discussion).
In a Lagrangian description, the observer follows individual parti-
cles through the flow field. In contrast, for the Eulerian point of
view the observer considers the flow density through fixed refer-
ence points. These two frames of reference are illustrated in Fig. 5.
In this picture, metastatic growth becomes “flow through size
space.” In the PP model this is represented in a Lagrangian fashion:
a growth function is associated to each individual metastasis. In the
size-structured model an Eulerian frame of reference is used: the
entire population of metastatic tumors is described through a den-
sity function moving through size space at a “speed” g(x) (i.e., the
growth rate), in other words a size-structured density.
Formalizing metastatic growth from an Eulerian perspective
leads to a PDE model. Metastatic emission is the boundary
Time Time
Size Size
X m (t – s)
Fig. 5 Representation of the Lagrangian (left) and Eulerian (right) frames of reference for describing a
population of growing metastases. Left: the observer (the eye symbol) follows the growth curves of individual
metastases; time and size coordinates determine the observer’s position. Right: a static observer looks from
0
the outside at the growth speed g in fixed time-size areas. The relationship Xm (t) ¼ g(Xm(t)) holds
condition of the PDE, which means that it describes the “arrival of

new particles into size space” (see Eq. 9). In contrast to the PP
cascade model, where metastatic emission was a stochastic process,
in the size-structured model (the PDE model) emission is deter-
ministic. The emission dynamics consists of a primary tumor con-
tribution and a contribution of the metastases themselves, both
depending on the size of the emitting tumor. The size-dependency
of the metastatic emission rate entwines metastatic growth with
metastatic emission dynamics, which requires special attention dur-
ing mathematical analysis of the model [39] as well as for designing
an efficient numerical resolution scheme [40].
To illustrate the model dynamics, the clinical parameters of
Table 1 were used to simulate the metastatic density function at
different times (see Fig. 6). The model equations, together with
relevant properties of the size-structured model, are presented in
Subheading 5.3.
Time evolution of metastatic density
t = 1 year
0
10 N micro = 0.1
N macro = 0
M = 320
−2 t = 2 years
10 N micro = 14
N macro = 0.1
Metastatic density ρ(t, x)
M = 8.5 · 10 6
−4
10 t = 3 years
N micro = 118
N macro = 15
M = 3.8 · 10 9
−6
10
−8
10
−10 with t
10 increasing
−12
10
0 2 4 6 8 10
10 10 10 10 10 10
Size x (number of cells)
Fig. 6 Time evolution of the metastatic density in the size-structured model for metastasis. Each solid line
represents a snapshot of the metastatic density ρ at a particular point in time (1/2/3 years after inception of
the primary tumor). Due to the growth dynamics, the density is transported to the right. Several quantities
computable from the density are represented in the legend: Nmicro, number of metastases smaller than 108
cells; Nmacro, number of metastases larger than 108 cells; M, total metastatic mass (number of cells of all
metastases together)
3.2 Bridging the Gap: We now describe how the size-structured model and the PP cascade
Model Observables model are related. At first view, the two frameworks describe quite
different objects. While the PP cascade is concerned with a collec-
tion of emission times with a generational hierarchy, the size-
structured model features a density function. Nevertheless, as we
will see, the latter can be seen as the expectation of the PP cascade
model. To describe precisely the relationship between the models,
we need to introduce model observables as a common theme. In fact,
we have already introduced some model observables without nam-
ing them so. The model observables include the number of metas-
tases, the number of micro-/macro-metastases, and the total
metastatic mass.
Let us start with the size-structured model. For each function f,
a model observable (MO) is defined by
1
Z
xm
MOf ðtÞ :¼ f ðxÞρðx, tÞdx, ð6Þ

x 0m
where xm0 is the size of a newly emitted metastasis and xm1 denotes
the theoretical upper boundary, i.e. it is integrated over all possible
sizes of metastases. Different choices for f are possible, and each of
them corresponds to one observable (this dependency is made
explicit through the subscriptf in MOf). The definition includes
the above-mentioned quantities:
l The number of metastases is obtained for f ¼ 1, i.e. the function
R x1
that equals 1 for all x: MO1 ðtÞ ¼ x 0m 1 ρðx, tÞdx ¼ N ðtÞ.
m
l Similarly, the number of macrometastases is obtained with

1 if x c
f macro ðxÞ ¼
0 if x < c,
and the number of micrometastases with

0 if x c
f micro ðxÞ ¼
1 if x < c:
Here c stands for the detectability threshold, which depends on

the imaging modality.
l The total metastatic mass M is obtained with the identity func-
tion fId(x) ¼ x for all x:
1
Z
xm
MOf Id ðtÞ ¼ xρðx, tÞdx ¼ M ðtÞ:

x 0m
Apart from allowing us to consider all these model-derived quan-

tities at once, it is important for the mathematical proofs in Sub-
heading 5 to consider such a general notion of observable.
Writing down the model observables in the PP cascade model is
slightly more complicated and it will be easier to illustrate it with an
observable in the PP model without secondary emission. There, a
stochastic model observable (SMO) is defined by
X
Nt
SMOf ðtÞ :¼ f ðX m ðt T ðkÞ ÞÞ: ð7Þ
k¼1
The observables are defined in such a way that their interpretation is

the same in both frameworks. For example, f ¼ 1 yields the number
of metastases Nt
X
Nt
SMO1 ðtÞ ¼ 1 ¼ Nt,
k¼1
and fId yields the metastatic mass Mt
X
Nt
SMOf Id ðtÞ ¼ X m ðt T ðkÞ Þ ¼ M t : ð8Þ
k¼1
If we ordered all emission events including secondary emissions by

increasing emission time (and still called these times T(1), T(2),
etc.), this could also be used as the definition of a stochastic
model observable in the PP cascade. However, in order to carry
out the calculations required for bridging the gap between the two
frameworks, we need to account for the filiation of a metastasis,
i.e. its level in the generational hierarchy. An explicit definition of
the SMO using filiation is provided by Eq. 13 in Subheading 5.4.
Similarly to Eq. 2, where the expected number of metastases
½N t was computed in the PP model without secondary emission,
an expression for the expectation and variance of each SMO can be
derived in the PP cascade model. These computations are more
complicated and are presented in detail in Appendix. It is then
shown that the expected value of each SMO is equal to the
corresponding MO in the size-structured model; in this sense, the
size-structured model describes the mean behavior of the PP cas-
cade model:
MOf ðtÞ ¼ ½SMOf ðtÞ:
A rigorous mathematical statement of these results is given in

Subheading 5.4.
The relationship between model observables in the two frame-
works is a consequence of a relationship between more fundamental
mathematical objects (a random measure in the PP cascade and an
absolutely continuous measure in the size-structured model). For

the sake of simplicity, we do not present this additional layer here
and refer to Subheading 5.4 for more details.
3.3 Implications In physics, a density is usually derived on the hypothesis of a large

number of constituting particles. In their derivation of the size-
structured model, these principles were applied to metastasis
[23]. However, this density notion is challengeable during the
early phase of metastasis where the number of metastases is low:
what is one single metastasis spread over the whole size range? The
alternative interpretation as the expected value of a cascade of PPs
provides a more flexible framework. For any model observable
(e.g., the number of metastases), the adequacy of the size-
structured model can be evaluated by quantifying the variance of
the corresponding PP cascade.
Let us illustrate this approach by an example. When parame-
trizing the size-structured model from clinical data on the size
distribution of metastatic colonies, the model authors did not
represent randomness inherent in the emission process [23]. To
account for this neglected source of variability, we use the crosslink
between size-structured and PP cascade models. By simulating the
PP cascade model with the same parameters (Table 1), standard
deviation as well as typical trajectories of the stochastic model can
be taken as a measure of variability around the prediction by the
size-structured model. We choose the observables used in [23] to
parametrize the model, i.e. the number of metastases exceeding
certain size thresholds c (i.e., fmacro with different thresholds).
Simulation results are shown in Fig. 7.
The average deviation of the data from the size-structured
model prediction is much smaller than the stochastic fluctuation
of the PP cascade model, and we can interpret this from two
different perspectives. On the one hand, since these deviations are
consistent with typical trajectories, the data are in principle explain-
able by stochasticity of emission. On the other hand, since the
range of plausible trajectories is relatively wide using the estimated
model parameters, different sets of model parameters would also be
compatible with the same observed trajectory. To put it differently,
the precision of the parameters of the size-structured model is
probably overestimated since the variability by randomness of emis-
sion is not taken into account.
Parameter estimation is much easier in deterministic than in
stochastic models. In special cases, such as in a PP model without
secondary emission [22], it is possible to estimate model para-
meters in a stochastic model. However, if the statistical model
becomes more complicated, e.g. a mixed-effects model to deal
with population data [41], the computational and even methodo-
logical feasibility limit is quickly reached with a stochastic structural
model [42]. In this case, the crosslink described in Subheading 3.2
Size threshold 2 · 10 7 cells Size threshold 10 8 cells Size threshold 10 9 cells

100 100 100
Expectation
Number of macrometastases
Variability
80 80 80
Data
Stochastic tra jectory
60 60 60
40 40 40
20 20 20
0 0 0
1000 1100 1200 1300 1400 1000 1100 1200 1300 1400 1000 1100 1200 1300 1400
Time (days) Time (days) Time (days)
Fig. 7 Comparison of residual variability from the size-structured model fit and stochastic variability of the PP
cascade model. Expectation (bold solid line) is the size-structured model prediction Nmacro(t), which was used
to fit the clinical data from [23] (computed via Eq. 10). Variability of the corresponding PP cascade model is
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
displayed in two ways: through stochastic trajectories (thin lines) and ½N macro, t p\pm2 var½N macro, t , with
var[Nmacro,t] computed via Eq. 15 (Variability, bold dashed line). As in [23], we count time from inception of the
first primary tumor cell, which was back-calculated from primary tumor data assuming Gompertzian growth
(hence the first CT scan with metastatic disease is approximately 3 years post-inception)
can be exploited to derive reasonable parameters for simulations

from the PP cascade model by estimating the parameters of its
mean, i.e. the size-structured model.
We applied this reasoning already in Fig. 7, in which the varia-
bility due to stochasticity of emission was discussed. To give an
example using the stochastic nature of the PP cascade model more
explicitly, we now use the stochastic framework to assess the impact
of secondary metastatic emission following surgery of the primary
tumor. Using the clinical parameters stated above, we assume that
the primary tumor is surgically removed 500 days after its incep-
tion, where it has reached a tumor mass of 180 g, and assess the
number of metastases another 500 days later (Fig. 8). Since every
secondary emission is preceded by at least one primary emission,
the probability of metastatic disease is the same in both models
(with and without secondary emission). However, on average a
much larger number of metastases is predicted from the model
with secondary emission ( ½N t ¼ 4:7 with secondary emission
vs. ½N t ¼ 1:2 without).
4 Summary and Outlook
This chapter focuses on Poisson processes as possible frameworks

describing metastatic emission, usable to predict metastatic risk or
micrometastatic dynamics. Although representing randomness in a
40
Probability (in %) 35
30 With secondary metastatic emission
25
20
15
10
5
0 1 2 3 4 5 6 7 8
40
35
Probability (in %)
30 Without secondary metastatic emission

25
20
15
10
5
0 1 2 3 4 5 6 7 8
Number of metastases
Fig. 8 Probability of metastatic disease after surgery with (top panel) or without (bottom panel) secondary
metastatic emission (each based on 10.000 simulations). In addition to the clinical parameters derived in [23],
it is assumed that the primary tumor is surgically removed 500 days after its inception, and that the number of
metastases is evaluated another 500 days later
relatively simple way, PPs have appealing properties that have been
illustrated here. They can be easily included as building blocks in
larger models, which has been shown with the PP cascade model,
but which applies in a much more general way. Also, they allow for a
high degree of analytical tractability, which was exploited here to
characterize the mean behavior of the PP cascade model.
Without doubt, further improvements of these techniques are
required. In particular, to make individualized risk predictions with
the model we have to match patient characteristics to model para-
meters. In this respect, circulating biomarkers, such as circulating
tumor cells or circulating tumor DNA can be a useful source of
information, especially since quantification methods are rapidly
getting more reliable [43, 44, 2]. Both frameworks presented
here allow for such an extension. Once validated, a mathematical
model can serve as a powerful tool for informed treatment decisions
for cancer patients by integrating case-specific information into a
consistent quantitative framework.
While this chapter has focused on the natural metastatic emis-
sion kinetics, it is possible to extend the formalism to cover systemic
treatments such as chemotherapy (represented as a size function
Xm(t; tincept) depending on inception time in the stochastic context
of Subheading 2.2, or a time-varying growth rate g(x, t) in the
deterministic context of Subheading 3.1). However, although
both the deterministic and stochastic settings can be extended in

this way, a direct link between these two extended frameworks (as in
Subheading 3.2) has not been established yet.
5 Mathematical Formalism and Results
This section is devoted to the mathematical formalism and the

derivations of the results of Subheadings 2 and 3. We provide
here precise definitions of the mathematical objets and rigorous
computations and results. We start by introducing the
non-homogeneous Poisson process and derive formula (Eq. 1)
from the Poisson assumption. Next, we summarize key results for
the size-structured model, we introduce the probabilistic frame-
work for secondary emissions, and then derive rigorously the link
between these two models.
Throughout this section ðΩ, F , ℙÞ is a probability space on
which all the random variables we consider are defined.
5.1 Definition of a The Poisson distribution is a standard way to count the occurrences
Poisson Process of some events.
and Basic Properties Definition 5.1 (Poisson Distribution).: Let μ 0. A random vari-
able Y with values in  is said to have a Poisson distribution with
parameter μ, that we denote by Y PðμÞ, if for all k∈
μk
ℙðY ¼ kÞ ¼ e μ
k!
for μ > 0, and if ℙðY ¼ 0Þ ¼ 1 in the case μ ¼ 0.
The parameter μ∈ℝþ can be interpreted as the expected num-
ber of occurrences since
½Y ¼ μ with Y PðμÞ:
In our context, it counts the number of metastases. However, at
this level, we have no information on the event times we are
counting, nor how this number evolves with respect to time. To
handle the random nature of these times, let us introduce the
Poisson processes.
Definition 5.2 (Non-homogeneous Poisson Process).: Let
λ : ℝþ ! ℝþ be a continuous function. We say that (Nt)t 0 is a
non-homogeneous Poisson process with intensity λ if:
1. N0 ¼ 0;
2. the number of occurrences in disjoint time intervals is indepen-
dent, i.e. for t0 < . . . < tn, the random variables N t k N t k1 , k ¼
1, . . ., n are independent;
3. For all t > 0, Nt has a Poisson distribution with parameter ΛðtÞ,
given by
Zt
ΛðtÞ ¼ λðuÞdu:
0
The terminology non-homogeneous results from the fact that the
intensity function λ can vary in time, as opposed to a homogeneous
Poisson process for which λ is constant. Also, there are several
equivalent definitions for a non-homogenous Poisson process.
For instance, the third item above can be replaced by the following
properties:
ℙðN tþΔt N t ¼ 1Þ ¼ λðtÞΔt þ oðΔtÞ
and ℙðN tþΔt N t 2Þ ¼ oðΔtÞ,
where oðΔtÞ stands for a function satisfying oðΔtÞ=Δt ! 0 as

Δt ! 0. In the previous definition, we chose λ as a continuous
function since we would not expect any discontinuities in rate of
metastatic emission. Nevertheless, mathematically this assumption
can be relaxed to more general nonnegative functions.
Finally, from a Poisson process (Nt)t 0, one can define the
event times recursively as
T ðkÞ ¼ inf ðt > T ðk1Þ , N t ¼ N t þ 1Þ fork ¼ 1, 2, . . .
with T(0) ¼ 0. We refer to Fig. 1 for an illustration of the relation

between (Nt)t 0 and the event times. From these times, one can
consider the following (random) measure on ℝþ :
X
þ1
PðduÞ :¼ δT ðkÞ ,
k¼1
where δx stands for the Dirac distribution at point x. This measure is

called the Poisson random measure associated to (Nt)t 0. From
this definition of P, it is direct to see that for any t 0,
X
þ1
N t ¼ Pð½0, tÞ ¼ 1ðT ðkÞ t Þ :
k¼1
Here, 1A is the indicator function of A, that is it takes the value 1 if

A is true and 0 otherwise. In the same way, the total metastatic
biomass (Eq. 8), for instance, can be rewritten as
X
þ1 Z t
ðkÞ
Mt ¼ 1ðT ðkÞ t Þ X m ðt T Þ ¼ X m ðt uÞPðduÞ:
k¼1
0
Expressions involving an integral with respect to the Poisson mea-
sure allow convenient manipulations as we will see in Appendix
using Proposition A.1.
5.2 Derivation Let us assume that the primary tumor diameter d(t) follows a power
of Empirical Formula law:
from Poisson
d 0 ðtÞ ¼ a dðtÞα , dð0Þ ¼ d 0 , 0 < α < 1:
Assumptions
Power growth of volume with a power between 2/3 and 1 has been
described in the literature, leading to the above model if we assume
a spherical shape of the tumor.
Furthermore, let us assume that metastatic emission is gov-
erned by a Poisson process with intensity λ(t) ¼ b d(t)β. We will
require β > 0, since the emission rate should increase with primary
tumor size. Then, the number of metastases
Rt Nt by time t is Poisson
distributed with parameter ΛðtÞ ¼ 0 λðsÞds and the probability of
metastasis-free disease at time t is given by
Zt
ℙðno metastasesÞ ¼ ℙðN t ¼ 0Þ ¼ exp b ð dðsÞβ dsÞ:
0
Here, we have
Zt Zt Zt βαþ1
b β
d ðs Þds ¼
b
a
0 βα
d ðs ÞdðsÞ ds ¼
b
a
d
dt
ð βdðsÞ
αþ1
:ds
0 0 0
b
¼ ðdðtÞβαþ1 d βαþ1 Þ,
aðβ α þ 1Þ 0
and assuming that the tumor is initiated with a negligible size (d0
d(t)), one obtains
Zt
b
b dðsÞβ ds
dðtÞβαþ1 :
aðβ α þ 1Þ
0
This then yields the empirical formula
ℙðno metastasesÞ ¼ expð c dðtÞz Þ,
with c ¼ aðβαþ1Þ
b
, and z ¼ β α þ 1. Since β > 0 and α 1, we have
z > 0, and the above manipulation is justified.
It should be noted that although c and z can be determined
unambiguously from information on metastatic status and primary
tumor size at surgery if the patient cohort is large enough, this does
not apply for the growth and emission parameters of the underlying
Poisson process. To distinguish the growth and emission processes
additional information is required, such as repeated tumor size
measurements over time.
5.3 Summary of Key The framework proposed in [23] focuses on the evolution of a size-
Results on the Size- structured metastatic density ρ. Originally, it was assumed that
Structured Model primary and secondary tumors have the same growth and emission
patterns. Here, we present an extended version, described, e.g., in
[40], where primary and secondary growth and emission dynamics
can be different.
As before, Xp denotes the size of the primary tumor and γ p the
primary tumor emission law. The size of a metastasis is given by Xm,
which is the solution of an autonomous ordinary differential
equation
X 0m ðtÞ ¼ gðX m ðtÞÞ, X m ð0Þ ¼ x 0m :
The emission law of the metastases is γ m. Then, the metastatic

density function ρ solves the following equation:
8
>
> ∂t ρðx,tÞ þ ∂x ½gðxÞρðx, tÞ ¼ 0, ðx,tÞ∈ðx 0m ,x 1
m Þ ð0, þ 1Þ,
>
>
>
>
>
<
1
Zx m
> gðx 0m Þρðx 0m , tÞ ¼ γ p ðX p ðtÞÞ þ γ m ðxÞρðx,tÞdx, t∈ð0, þ 1Þ,
>
>
>
>
> xm0
>
:
ρðx,0Þ ¼ 0, x∈½x 0m ,x 1
m :
ð9Þ
We also introduce the emission rates of the primary tumor and the
metastases, respectively:
λp ðtÞ :¼ γ p ðX p ðtÞÞ and λm ðtÞ :¼ γ m ðX m ðtÞÞ:

Existence of a unique weak solution ρ to this model has been
shown under general conditions in [39]. For the purpose of this
chapter, it is sufficient to assume that λp, γ m, and g are continuously
differentiable nonnegative functions, and that limt ! þ1λp(t) <
þ1 and g(xm1) ¼ 0.
Model observables for the size-structured model have been
introduced in Eq. 6. As shown in [40], they can be characterized
as the solutions of a Volterra convolution equation:

Theorem 5.1.: For any f ∈L 1 ½x 0m , x 1
m , MOfis the unique solution
of the following renewal equation:
Zt Zt
M O f ðtÞ ¼ λp ðsÞf ðX m ðt sÞÞds þ λm ðsÞM O f ðt sÞds: ð10Þ
0 0
This alternative formulation will be important to bridge the gap
between the stochastic and deterministic frameworks. Of note, it is
also the basis of an efficient numerical resolution algorithm [40].
5.4 Probabilistic Let us first remind the reader that we assume all emissions of the
Framework primary tumor and of the metastases to be independent. To exploit
for Secondary this property in calculations, we have to take care of the filiation of
Emission each metastasis, i.e. the generational hierarchy (the primary tumor,
the metastases emitted from the primary tumor, the metastases
emitted from the metastases emitted from the primary tumor,
etc.). We will therefore introduce a cascade of independent PPs,
and define recursively the emission times with respect to the gener-
ational hierarchy.
l The emission times for the first generation of metastases, that is,
the ones emitted by the primary tumor, are the event times of a
PP (Nt(1))t 0 with intensity λp; we will write Πð1Þ :¼ ðT ðj Þ Þj 1
for the set of random emission times.
The emission times for the next generations of metastases are
defined recursively.
l Let k 2 and n1, . . ., nk1 1. The jth emission time for the kth
generation of metastasis with filiation n1, . . ., nk1 is defined by
ðn1 , ..., nk1 , j Þ
T ðn1 , ..., nk1 , j Þ :¼ T ðn1 , ..., nk1 Þ þ T~ ð11Þ
This is the time it takes for the offspring with filiation n1, . . .,
n
k1ðn ,to give birth to its jth offspring. Here, the family
1 ..., nk1 , j Þ
~
T is formed by the event times of a PP
ðn , ..., n Þ j 1
N 1 k1
t0
with intensity λm.
We refer to Fig. 4 for an illustration of these emission times,

but for instance, T (2,3,4) is the inception time of the fourth off-
spring produced by the third offspring of the second offspring of
the primary tumor. Using biologically relevant parameters, the
expected emission times for all but the first few generations are
larger than any reasonable observation timeframe. However, even if
the contribution of late generations is very small, we need to
consider the whole cascade of emission times to bridge the gap to
the size-structured model.
Finally, assuming that

f N tðn1 , ..., nk Þ , k 1, n1 , . . . , nk 1g
t0
is a family of independent PPs implies the biological assumption we
made, which is that the primary tumors and all the metastases emit
independently from each other.
In the PP model without secondary emission, model observa-
bles were defined in Eq. 7. With the PP cascade defined above, we
are now able to extend this concept to secondary emission con-
structively. For f ∈ L1([xm0, xm1]), the SMO for the kth genera-
tion can be expressed by
ðkÞ
X
SMOf ðtÞ :¼ 1ðT ðn1 , ..., nk Þ t Þ f ðX m ðt T ðn1 , ..., nk Þ ÞÞ, ð12Þ
n1 , ..., nk 1
and we have the following definition.

Definition 5.3 (Model Observables with Secondary Emission).:
The SMOs with secondary emission are given by
X
þ1
ðkÞ
SM O f ðtÞ :¼ SM O f ðtÞ: ð13Þ
k¼1
In this definition, SMOf(k) describes the contribution of the kth

generation to the SMO.
The following proposition links the MOs from the stochastic
and deterministic frameworks.
Proposition 5.1 (Link to the Model Observables).: Let

f ∈L 1 ½x 0m , x 1
m . The SMO (Eq. 13) is well defined in the sense
that
ℙð8t 0, 0 SM O f ðtÞ < þ1: ¼ 1:

Moreover, the expected value
e f ðtÞ :¼ ½SM O f ðtÞ

is finite for all t 0 and satisfies (Eq. 10), so that
e f ðtÞ ¼ M O f ðtÞ:
Let us remark that the SMO (Eq. 13) may also be seen as
integrals w.r.t. a random measure
Z
SMOf ðtÞ ¼ f ðxÞMt ðdxÞ,
for any t 0, with

X X
Mt :¼ δX m ðtT ðn1 , ..., nk Þ Þ : ð14Þ
k1 n1 , ..., nk 1
This description is the key point to bridge the gap to the description
of metastasis via a structured population equation.
Theorem 5.2 (Link to the Structured Population Model).: For all
t 0, the measure
μt :¼ ½Mt
is σ-finite, absolutely continuous with respect to the Lebesgue
measure, and its Radon–Nikodýn density is given by ρ( , t),
dμt
¼ ρð, t Þ,
dx
where ρ is the solution of the structured population
equation (Eq. 9).
The last result we present here concerns the variability of the

SMO (Eq. 13) with respect to its mean MOf.

Proposition 5.2 (Variance of Observables).: Let f ∈L 1 ½x 0m , x 1
m .
The variance of the SMO
vf ðtÞ :¼ var½SM O f ðtÞ
is finite for any t 0, and satisfies a renewal equation:
Zt
vf ðtÞ ¼ λp ðsÞðf ðX m ðt sÞÞ þ e m, f ðt sÞÞ2
0
Zt
þ λm ðsÞvf ðt sÞ: ð15Þ
0
Here,
e m, f ðtÞ :¼ ½SM O m, f ðtÞ,

and where SMOm, fis defined as (Eq. 13), but for a different cascade
of PPs, which has only λmfor intensity (both for the first and
subsequent generations).
This result is of great interest to design confidence intervals as
illustrated in Subheading 3.3. In fact, the renewal equation (Eq. 15)
allows the use of an efficient numerical resolution algorithm [40].
6 Notes
1. The interpretation of β depends on the unit of the primary

tumor measure. As an example, a surficial vascularization
would correspond to β ¼ 2 if size is measured in diameter, but
to β ¼ 2/3 (the fractal dimension of a surface in space) if size is
measured in volume.
2. We remind the reader that a surgery at time tsurgery is represented
by setting the emission intensity λ to zero for all times larger
than tsurgery.
Acknowledgements
The authors thank Florence Hubert, Charlotte Kloft, and Andrea

Henrich for suggestions and critical reading of the manuscript. NH
gratefully acknowledges financial support by the Agence Nationale
de la Recherche under grant ANR-09-BLAN-0217-01.
A Appendix: Proofs of Results of Section 5.4
The proofs provided in this section are based on the following

classical result on Poisson random measures. We refer to [45,
Chap. 6, pp. 251] for further details. Also, note that this result
directly yields Eqs. 2 through 5.
Proposition A.1.: Let (Nt)t 0be a PP with intensity λ and P the
corresponding Poisson random measure. We have for
ϕ, ψ∈L 1 ðℝþ , λðuÞduÞ \ L 2 ðℝþ , λðuÞduÞ
Z Z
 ϕðuÞPðduÞ ¼ ϕðuÞλðuÞdu,
and

R R R R
 ϕðu1 ÞPðdu1 Þ ϕðu2 ÞPðdu2 Þ ¼ ϕðu
R 1 Þλðu1 Þdu1 ψðu2 Þλðu2 Þdu2
þ ϕðuÞψðuÞλðuÞdu:
In other words, we can write the first order moment of the

Poisson random measure P in a more compact form
½PðduÞ ¼ λðuÞdu,
as well as its second order moment
½Pðdu1 ÞPðdu2 Þ ¼ λðu1 Þλðu2 Þdu1 du2 þ δðu1 u2 Þλðu1 Þdu1 du2 :
Moreover, to simplify notations in the forthcoming computations,

we introduce the following convolution-like notation: for functions
ϕ, ψ
Zt
ϕ∗ψðtÞ :¼ ϕðt uÞψðuÞdu: ð16Þ
0
A.1 Proof We first need to establish the following lemma, which is proven
of Proposition 5.1 further below.
Lemma A.1.: We have
e f ¼ λp ∗ðf ðX m Þ þ e m, f Þ, ð17Þ
where em, fhas been introduced in Proposition 5.2.
This is not exactly the renewal equation we want. To derive the
desired equation (Eq. 10) we just have to make the following
remark. Taking λp ¼ λm, Lemma A.1 gives that em, f satisfies
e m, f ¼ λm ∗ðf ðX m Þ þ e m, f Þ:
Hence, from (Eq. 17), we have
ef ¼ λp ∗f ðX m Þ þ λp ∗ðλm ∗f ðX m Þ þ λm ∗e m, f Þ
¼ λp ∗f ðX m Þ þ λm ∗ðλp ∗f ðX m Þ þ λp ∗e m, f Þ ð18Þ
¼ λp ∗f ðX m Þ þ λm ∗e f :
Now, let T > 0, we have from the last line of (Eq. 18) that for all t ∈
[0, T],
Zt
e f ðtÞ f L 1 ð½x 0 , x 1 Þ λpL 1 ð½0, T Þ T þ λmL 1 ð½0, T Þ e f ðuÞdu,
m m
0
which gives using Gronwall’s inequality
sup e f ðtÞ C T , f , λp , λm f L 1 ð½x 0 , x 1 Þ < þ1: ð19Þ

t∈½0, T
m m
As a result, ef(t) < þ1 for all t 0 since T is arbitrary, and also
ℙðSMOf ðT Þ < þ1: ¼ 1:

Finally, using that t ↦ SMOf(t) is an increasing non-negative
function, we have
ℙð8t∈½0, T , SMOf ðtÞ < þ1: ¼ 1,
and then
ℙð8t 0, SMOf ðtÞ < þ1: ¼ lim ℙð8t∈½0, n, SMOf ðtÞ < þ1: ¼ 1:
n!þ1
Proof (of Lemma A.1).: Let us start with the following remark.
According to the recursive definition (Eq. 11) of our PP cascade, one
has
ðn1 , ..., nk Þ
T ðn1 , ..., nk Þ ¼ T ðn1 Þ þ T , ð20Þ
where all the times
ðn1 , ..., nk Þ
fT , k 2, n1 , . . . , nk 1g
are independent of Π :¼ ðT ðn1 Þ Þn1 1 .
ð1Þ
Now, from this consideration, by taking apart the first generation of

metastasis, we can rewrite SMOf as follows:
X
SMOf ðtÞ ¼ 1ðT ðn1 Þ t Þ f ðX m ðt T ðn1 Þ ÞÞ
n1 1
X
þ 1ðT ðn1 Þ t Þ SMOn1 , f t T ðn1 Þ ð21Þ
n1 1
:¼ I þ J,
with
X X
SMOn1 , f ðtÞ :¼ 1 ðn1 , ..., nk Þ
f ðX m ðt T ðn1 , ..., nk Þ ÞÞ,
T t
k2 n2 , ..., nk 1
which are independent of Πð1Þ . Note that all the times

ðn , ..., n Þ
Π n1 :¼ fT 1 k
, k 2, n2 , . . . , nk 1g,
can be defined following (Eq. 11), but with λm as intensity for all the
PPs since we consider all the times from the second generation. There-
fore, ðSMOn1 , f Þn1 1 are all independent. Moreover, the shape of all the
SMOn1 , f is similar to SMOf except that the PPs in the cascade have all
λm for intensity. Hence, all the SMOn1 , f have the same law as SMOm, f.
Using Proposition A.1 with the Poisson random measure P(1)(du)
associated to (Nt(1))t 0 (with intensity λp), it is direct to see that
2t 3
Z Zt
½I ¼ 4 f ðX m ðt uÞÞP ðduÞ5 ¼ λp ðuÞf ðX m ðt uÞÞdu
ð1Þ
0 0
¼ λp ∗f ðX m ÞðtÞ:
For the second term, using standard properties of the conditional
expectation (especially ½X ¼ ½½X jY ), one has
X
½I I ¼  ½
1ðT ðn1 Þ t Þ ½SMOn1 , f t T ðn1 Þ jΠð1Þ
n1 1
with

½SMOf , n1 t T ðn1 Þ jΠð1Þ : ¼ ½SMOn1 , f ðt uÞju¼T ð1Þ
n 1
¼ ½SMOm, f ðt uÞju¼T ðn1 Þ

¼ e m, f t T ðn1 Þ , ð22Þ
and then
X Zt
1ðT ðn1 Þ t Þ ½SMOn1 , f t T ðn1 Þ
jΠ ¼
ð1Þ
e m, f ðt uÞP ð1Þ ðduÞ:
n1 1
0
This, together with Proposition A.1, yields
Zt Zt
½I I ¼  ½ e m, f ðt uÞP ð1Þ
ðduÞ: ¼ λp ðuÞe m, f ðt uÞdu
0 0
¼ λp ∗e m, f ðtÞ,
which concludes the proof of (Eq. 17). □
A.2 Proof Using the same strategy as for (Eq. 18), the proof of (Eq. 15)
of Proposition 5.2 consists only in proving the following relation:
vf ¼ λp ∗ðf ðX m Þ þ e m, f :2 þ λp ∗vm, f , ð23Þ
with vm, f(t): ¼ var[SMOm, f(t)]. Knowing Proposition 5.1 and the
formula of the variance, one can focus on the term
ð2Þ
e f ðtÞ :¼ ½SMO2f ðtÞ. Using (Eq. 21), we have to compute three
terms
ð2Þ
e f ðtÞ ¼ ½I 2 þ 2½I J þ ½J 2 :
Rt
The Term ½I 2 : Reminding that I ¼ 0 f ðX s ðt uÞ:P ð1Þ ðduÞ, and
using Proposition A.1, it is direct that
Zt Zt
½I ¼
2
ð λp ðuÞf ðX m ðt uÞÞduÞ þ 2
λp ðuÞf 2 ðX m ðt uÞÞdu:
0 0
The Term ½I J : Using standard properties of the conditional expec-
tation, and that for all n1 1
e m, f ðtÞ ¼ ½SMOn1 , f ðtÞ,

we have using (Eq. 22)
2 3
X i
½I J ¼ 4
ðT 1 tÞ ðT 1 tÞ
n11
½
1 ðn1 Þ 1 ðn2 Þ f ðX m ðt T ð Þ ÞÞ SMOn21 , f ðt T ð Þ Þ πð1Þ 5 n21
j
2 ,n11 n21 1
3
X
¼ 4 1 ðn1 Þ 1 ðn2 Þ f ðX m ðt T ðn1 Þ ÞÞe m, f ðt T ðn1 Þ Þ5:
1 2
ðT 1 tÞ ðT 1 tÞ
n11 , n21 1
As result, according to Proposition A.1, we have

2t
Z Zt
½I J ¼  4 ð1Þ
f ðX m ðt uÞ:P ðduÞ e m, f ðt uÞ P ð1Þ ðduÞ
0 0
Zt Zt
¼ λp ðu1 Þf ðX m ðt u1 Þ:du1 λp ðu2 Þe m, f ðt u2 Þdu2
0 0
Zt
þ λp ðuÞf ðX m ðt uÞ:e m, f ðt uÞdu:
0
The Term ½J 2 : To compute this term we have to consider two cases
X
J2 ¼ 1ðT ðn1 Þ t Þ SMO2n1 , f t T ðn1 Þ
n1 1
X
1 ðn1 Þ 1 ðn2 Þ SMOn11 , f t T ðn1 Þ SMOn21 , f t T ðn1 Þ
1 2
þ
T 1 t T 1 t
n11 , n21 1
n11 6¼ n21
:¼ J 1 þ J 2 :
Following (Eq. 22), but with SMOn, f2 instead, we have
" #
X i
½J 1 ¼  1 ðn1 Þ  SMOn1 , f ðt T
ðT 1 tÞ
2 ðn1 Þ
½Þ Πð1Þ
j
n1 1
Zt i
¼ ½ ð2Þ
e m, f ðt uÞP ð1Þ ðduÞ
0
Zt
ð2Þ
¼ λp ðuÞe m, f ðt uÞdu,
0
ð2Þ
where ðtÞ :¼ ½SMO2m, f ðtÞ. Now using that SMOn11 , f and
e m, f
SMOn21 , f are independent for n11 6¼ n12, we have, using (Eq. 22)
and Proposition A.1 one more time,
½J 2 i ii
X
¼ ½ 1 ðn1 Þ 1 ðn2 Þ ½SMOn11 ,f t T ðn1 Þ jΠð1Þ ½SMOn21 ,f t T ðn1 Þ jΠð1Þ
T 1 t T 1 t
1 2
n11 6¼n21
X i X i
¼ ½ð 1ðT ðn1 Þ t Þ e m,f t T ðn1 Þ Þ2  ½
1ðT ðn1 Þ t Þ e 2m,f t T ðn1 Þ
n1 1 n1 1
Zt i Zt i
¼ ½ð e m,f ðt uÞP ð1Þ ðduÞÞ2  ½ e 2m,f ðt uÞP ð1Þ ðduÞ
0 0
Zt
¼ ð λp ðuÞe m,f ðt uÞduÞ2 :
0
Combining the three previous computations, we obtain
ð2Þ ð2Þ
ef ¼ ðλp ∗ðf ðX m Þ þ e m, f ÞÞ2 þ λp ∗f 2 ðX m Þ þ 2λp ∗ðf ðX m Þe m, f Þ þ λp ∗e m, f : ð24Þ
Considering this equation for λp ¼ λm, we obtain as for the expecta-

tion a renewal equation for em, f(2), which yields for all t > 0
Zt
ð2Þ ð2Þ
e m, f ðtÞ C 1 þ C 2 e m, f ðuÞdu,
0
and then for all T > 0,
ð2Þ
sup e m, f ðtÞ C 1, T þ C 2, T sup e 2m, f < þ1,
t∈½0, T t∈½0, T
using Gronwall’s inequality and Proposition 5.1. This proves that

½SMO2m, f ðtÞ < þ1 for all t 0, and then ½SMO2f ðtÞ < þ1 by
going back to (Eq. 24). Now, rewriting (Eq. 24), we obtain
ð2Þ
ef ¼ e 2f þ λp ∗ðf ðX m Þ þ e m, f Þ2 þ λp ∗vm, f ,
which is (Eq. 23).
A.3 Proof Using that Xm(s) ∈ [xm0, xm1] for all s∈ℝþ , the σ-finiteness and
of Theorem 5.2 absolute continuity of μt (for any t 0) are direct consequences
of (Eq. 19). Denoting by ~ρ t its Radon–Nikodým density, Proposi-
tion 5.1 and Theorem 5.1 then yield
1 1 1
Zx m Zx m Zx m
f ðxÞμt ðdxÞ ¼ f ðxÞ~ρ t ðxÞdx ¼ f ðxÞρðt, xÞdx,
x 0m x 0m x0
1
m
for all f ∈C ½x 0m , x 1
m \ L ½x 0m , x 1
m , which concludes the proof.
References
1. Çınlar E (2011) Probability and stochastics. 8. Gupta GP, Massagué J (2006) Cancer metasta-
Graduate texts in mathematics, vol 261. sis: building a framework. Cell 127
Springer, New York (4):679–695
2. Yu M, Bardia A, Wittner BS, Stott SL, Smas 9. Hanahan D, Weinberg RA (2011) Hallmarks
ME, Ting DT, Isakoff SJ, Ciciliano JC, Wells of cancer: the next generation. Cell 144
MN, Shah AM, Concannon KF, Donaldson (5):646–674
MC, Sequist LV, Brachtel E, Sgroi D, 10. Michor F, Nowak MA, Iwasa Y (2006) Sto-
Baselga J, Ramaswamy S, Toner M (2013) Cir- chastic dynamics of metastasis formation. J
culating breast tumor cells exhibit dynamic Theor Biol 240(4):521–530
changes in epithelial and mesenchymal compo- 11. Haeno H, Michor F (2010) The evolution of
sition. Science 339(6119):580–584 tumor metastases during clonal expansion. J
3. Nguyen DX, Bos PD, Massagué J (2009) Theor Biol 263(1):30–44
Metastasis: from dissemination to organ- 12. Anderson AR, Quaranta V (2008) Integrative
specific colonization. Nat Rev Cancer 9 mathematical oncology. Nat Rev Cancer 8
(4):274–284 (3):227–234
4. Sahai E (2007) Illuminating the metastatic 13. Koscielny S, Tubiana M, Lê MG, Valleron J,
process. Nat Rev Cancer 7(10):737–749 Mouriesse H, Contesso G, Sarrazin D (1984)
5. WHO (2015) Cancer fact sheet. http://www. Breast cancer: relationship between the size of
who.int/mediacentre/factsheets/fs297/en/ . the primary tumour and the probability of met-
Accessed 14 Jan 2016 astatic dissemination. Br J Cancer 49
6. Pantel K, Cote RJ, Fodstad O (1999) Detec- (6):709–715
tion and clinical importance of micrometastatic 14. Michaelson JS, Silverstein M, Wyatt J,
disease. J Natl Cancer Inst 91(13):1113–1124 Weber G, Moore R, Halpern E, Kopans DB,
7. Scott JG, Gerlee P, Basanta D, Fletcher AG, Hughes K (2002) Predicting the survival of
Maini PK, Anderson ARA (2013) Mathemati- patients with breast carcinoma using tumor
cal modeling of the metastatic process. In: size. Cancer 95(4):713–723
Malek A (ed) Experimental metastasis: model- 15. van de Vijver MJ, He YD, van’t Veer LJ, Dai H,
ing and analysis. Springer, Dordrecht, pp Hart AAM, Voskuil DW, Schreiber GJ, Peterse
189–208 JL, Roberts C, Marton MJ, Parrish M,
Atsma D, Witteveen A, Glas A, Delahaye L,
van der Velde T, Bartelink H, Rodenhuis S, 28. Newton PK, Mason J, Bethel K, Bazhenova L,
Rutgers ET, Friend SH, Bernards R (2002) A Nieva J, Norton L, Kuhn P (2013) Spreaders
gene-expression signature as a predictor of sur- and sponges define metastasis in lung cancer: a
vival in breast cancer. N Engl J Med 347 Markov chain Monte Carlo mathematical
(25):1999–2009 model. Cancer Res 73(9):2760–2769
16. Hahnfeldt P, Panigrahy D, Folkman J, Hlatky 29. Comen E, Norton L, Massague J (2011) Clin-
L (1999) Tumor development under angio- ical implications of cancer self-seeding. Nat Rev
genic signaling: a dynamical theory of tumor Clin Oncol 8(6):369–377
growth, response and postvascular dormancy. 30. Scott JG, Basanta D, Anderson AR, Gerlee P
Cancer Res 59:4770–5 (2013) A mathematical model of tumour self-
17. Norton L (1988) A Gompertzian model of seeding reveals secondary metastatic deposits as
human breast cancer growth. Cancer Res drivers of primary tumour growth. J R Soc
48:7067–7071 Interface 10(82):20130011
18. Verga F (2010) Modélisation mathématique de 31. Hanin L, Zaider M (2011) Effects of surgery
processus métastatiques. Ph.D. thesis, and chemotherapy on metastatic progression of
Aix-Marseille Université prostate cancer: evidence from the natural his-
19. Hart D, Shochat E, Agur Z (1998) The growth tory of the disease reconstructed through
law of primary breast cancer as inferred from mathematical modeling. Cancers 3
mammography screening trials data. Br J Can- (3):3632–3660
cer 78:382–387 32. Wheldon TE (1988) Mathematical models in
20. Benzekry S, Lamont C, Beheshti A, Tracz A, cancer research. Medical science series. Adam
Ebos JML, Hlatky L, Hahnfeldt P (2014) Clas- Hilger, Bristol/Philadelphia
sical mathematical models for description and 33. Benzekry S, Gandolfi A, Hahnfeldt P (2014)
prediction of experimental tumor growth. Global dormancy of metastases due to systemic
PLoS Comput Biol 10(8):e1003800 inhibition of angiogenesis. PLoS One 9(1):
21. Bartoszyński R, Edler L, Hanin L, Kopp- e84249
Schneider A, Pavlova L, Tsodikov A, Zorin A, 34. Bethge A, Schumacher U, Wedemann G
Yakovlev A (2001) Modeling cancer detection: (2015) Simulation of metastatic progression
tumor size as a source of information on unob- using a computer model including chemother-
servable stages of carcinogenesis. Math Biosci apy and radiation therapy. J Biomed Inform
171:113–142 57:74–87
22. Hanin L, Rose J, Zaider M (2006) A stochastic 35. Lewis PAW, Shedler GS (1979) Simulation of
model for the sizes of detectable metastases. J nonhomogeneous poisson processes by thin-
Theor Biol 243:407–417 ning. Nav Res Log Q 26(3):403
23. Iwata K, Kawasaki K, Shigesada N (2000) A 36. Sadahiro S, Suzuki T, Ishikawa K, Nakamura T,
dynamical model for the growth and size dis- Tanaka Y, Masuda T, Mukoyama S, Yasuda S,
tribution of multiple metastatic tumors. J Tajima T, Makuuchi H, Murayama C (2003)
Theor Biol 203:177–186 Recurrence patterns after curative resection of
24. Hartung N, Mollard S, Barbolosi D, colorectal cancer in patients followed for a min-
Benabdallah A, Chapuisat G, Henry G, imum of ten years. Hepatogastroenterology 50
Giacometti S, Iliadis A, Ciccolini J, Faivre C, (53):1362–1366
Hubert F (2014) Mathematical modeling of 37. Siegel R, DeSantis C, Virgo K, Stein K,
tumor growth and metastatic spreading: valida- Mariotto A, Smith T, Cooper D, Gansler T,
tion in tumor-bearing mice. Cancer Res Lerro C, Fedewa S, Lin C, Leach C, Cannady
74:6397–6407 RS, Cho H, Scoppa S, Hachey M, Kirch R,
25. Benzekry S, Tracz A, Mastri M, Corbelli R, Jemal A, Ward E (2012) Cancer treatment
Barbolosi D, Ebos JML (2016) Modeling and survivorship statistics, 2012. CA Cancer J
spontaneous metastasis following surgery: an Clin 62(4):220–241
in vivo-in silico approach. Cancer Res 76 38. Batchelor GK (1967) An introduction to fluid
(3):535–547 dynamics. Cambridge University Press,
26. Chaffer CL, Weinberg RA (2011) A perspec- Cambridge
tive on cancer cell metastasis. Science 331 39. Barbolosi D, Benabdallah B, Hubert F, Verga F
(6024):1559–1564 (2009) Mathematical and numerical analysis
27. Newton PK, Mason J, Bethel K, Bazhenova for a model of growing metastatic tumors.
LA, Nieva J, Kuhn P (2012) A stochastic Mar- Math Biosci 218:1–14
kov chain model to describe lung cancer 40. Hartung N (2015) Efficient resolution of met-
growth and metastasis. PLoS One 7(4):e34637 astatic tumour growth models by
reformulation into integral equations. Discrete Scoles G, Toffoletto B, Isola M, Beltrami CA,
Contin Dyn Syst B 20:445–467 Di Loreto C, Beltrami AP, Puglisi F, Cesselli D
41. Lavielle M (2014) Mixed effects models for the (2016) In patients with metastatic breast can-
population approach. models, tasks, methods cer the identification of circulating tumor cells
and tools. Chapman & Hall/CRC biostatistics in epithelial-to-mesenchymal transition is asso-
series. Chapman & Hall/CRC, Boca Raton ciated with a poor prognosis. Breast Cancer Res
42. Tornøe CW, Overgaard RV, Agersø H, Nielsen 18(1):30
HA, Madsen H, Jonsson EN (2005) Stochastic 44. Paoletti C, Hayes DF (2016) Circulating
differential equations in NONMEM: imple- tumor cells. Adv Exp Med Biol 882:235–258
mentation, application, and comparison with 45. Chen LL, Blumm N, Christakis NA, Barabasi
ordinary differential equations. Pharm Res 22 AL, Deisboeck TA (2009) Cancer metastasis
(8):1247–1258 networks and the prediction of progression
43. Bulfoni M, Gerratana L, Del Ben F, patterns. Br J Cancer 101(5):749–758
Marzinotto S, Sorrentino M, Turetta M,
Chapter 11
Mechanically Coupled Reaction-Diffusion Model to Predict

Glioma Growth: Methodological Details
David A. Hormuth II, Stephanie L. Eldridge, Jared A. Weis, Michael I. Miga,
and Thomas E. Yankeelov
Abstract
Biophysical models designed to predict the growth and response of tumors to treatment have the potential
to become a valuable tool for clinicians in care of cancer patients. Specifically, individualized tumor forecasts
could be used to predict response or resistance early in the course of treatment, thereby providing an
opportunity for treatment selection or adaption. This chapter discusses an experimental and modeling
framework in which noninvasive imaging data is used to initialize and parameterize a subject-specific model
of tumor growth. This modeling approach is applied to an analysis of murine models of glioma growth.
Key words Cancer, Biophysical stress, Diffusion, Invasion, MRI, Finite difference method
1 Introduction
Biophysical models of tumor growth and treatment response have

the potential to fundamentally change the clinical care for cancer
patients by providing clinicians with accurate and precise patient-
specific predictive models. Through the use of noninvasive imaging
data, these biophysical models can be parameterized by the unique
characteristics of an individual’s tumor to provide a “forecast” of
future tumor growth and treatment response [1]. We [2–6] and
others [7–11] have begun investigating the development of
patient-specific mathematical models of cancer. In this work, we
provide a detailed guide to the implementation of a mechanically
coupled reaction-diffusion model [4, 6, 12] applied to glioma
growth in rats.
The standard reaction-diffusion equation, Eq. 1, is commonly
used to model glioma growth [5, 7] and describes the spatial-
temporal evolution of tumor cell number due to the random move-
ment of tumor cells (diffusion, first term on the right-hand side)
225
226 David A. Hormuth II et al.
and the proliferation of cells (reaction, second term on the right-

hand side):
∂N ðx; y; z; t Þ
¼ ∇ D ðx; y; z Þ∇N x; y; z; t
∂t

N ðx; y; z; t Þ
þ kðx; y; z ÞN ðx; y; z; t Þ 1 , ð1Þ
θ
where N(x, y, z, t) is the number of tumor cells at the three-
dimensional position (x, y, z) and time t, D(x, y, z) is the tumor
cell diffusion coefficient, k(x, y, z) is the net tumor cell prolifera-
tion, and θ is the tumor cell carrying capacity. One important
limitation of the standard reaction-diffusion equation is that
tumor growth is only restricted by the boundaries of the simulation
domain (i.e., the skull for gliomas). In reality, as the tumor expands
it interacts with the healthy brain tissue causing increased mechani-
cal stress and the displacement of surrounding tissue, a phenomena
termed the “mass effect” [13] and observed in several types of brain
tumors [14]. The increased stress experienced by the tumor can
impede further growth as demonstrated in the seminal work by
Helmlinger et al. [15]. In Helmlinger et al.’s [15] contribution
multi-cellular spheroids were grown in agar gel concentrations
ranging from 0% to 1%. Increasing the agar concentration resulted
in inhibited expansion of the spheroid as the substrate stiffness
increased. More specifically, similar spheroid interactions with the
surrounding environment would require increased force at elevated
levels of stiffness. This phenomenon can also result in the preferen-
tial growth of tumors in areas of increasing mechanical compliance.
To incorporate this effect, we first describe the mechanical equilib-
rium, Eq. 2:
∇ σ λf ∇N ¼ 0, ð2Þ
where σ is the stress tensor and λf is tumor cell-force coupling
constant. For implementation, Eq. 2 is rewritten in terms of the
⇀
tissue displacement ( u ) under a linear elastic isotropic material
assumption in Eq. 3:
G
⇀ ⇀
∇ G∇u þ ∇ ∇ u λf ∇N ¼ 0, ð3Þ
1 2ν
where G is the shear modulus (a material property that represents
the constant of proportionality between shear stress to shear strain)
and ν is Poisson’s ratio (a material property that is a ratio relating
lateral to longitudinal strain). The first two terms on the left-hand
side in Eq. 3 represent the linear-elastic description of tissue dis-
placement, while the third term represents a local body force gen-
⇀
erated by the invading tumor. u is then used to calculate the local
normal (εxx, εyy, εzz) and shear strains (εxy, εxz, εyz). For small
deformations, strain εi,j is defined as the total deformation in the
Methods for a Mechanically Coupled Reaction-Diffusion Glioma Model 227
direction i divided by the original length in direction j and is

calculated using Eq. 4:
2 3 2 3
εxx ∂u=∂x
6 εyy 7 6 ∂v=∂y 7
6 7 6 7
6 εzz 7 6 ∂w=∂z 7
6 7¼6 7 ð4Þ
6 εxy 7 6 ∂u=∂y 7,
6 7 6 7
4 εxz 5 4 ∂u=∂z 5
εyz ∂v=∂z
where u, v, and w represent the deformation in the x-, y-, and z-
directions, respectively. The normal and shear strains are then used
to calculate the normal and shear stresses using Hooke’s law, Eq. 5:
2 3 2 32 3
σ xx 1ν ν ν 0 0 0 εxx
6 σ yy 7 6 ν 1ν ν 0 0 0 76 εyy 7
6 7 6 76 7
6 σ zz 7 2G 6 ν ν 1ν 0 0 0 76 εzz 7
6 7¼ 6 76 7
6 σ xy 7 1 2ν 6 0 0 0 ð1 2ν Þ 0 0 76 εxy 7: ð5Þ
6 7 6 76 7
4 σ xz 5 4 0 0 0 0 ð1 2νÞ 0 54 εxz 5
σ yz 0 0 0 0 0 ð1 2νÞ εyz
The normal and shear stresses for a given voxel are then
incorporated into a single term called the Von Mises stress, σ vm(x,
y, z, t), in Eq. 6:
2 0 2 2 131=2
σ xx ðx; y; z;t Þ σ yy x; y; z;t þ σ xx ðx;y;z; t Þ σ zz x;y;z;t
61B 2 C7
σ vm ðx;y;z; t Þ ¼ 6 B
42@ þ σ zz ðx;y;z; t Þ σ yy x;y;z; t
C7 :
A5
2 2 2
þ6 σ xy ðx; y;z;t Þ þ σ xz x; y; z;t þ σ yz x; y; z;t
ð6Þ
The Von Mises stress is a term that reflects the total experienced
stress for a given section of tissue, and is often used within failure
criterion strategies in materials. We use the Von Mises stress to
reflect the interaction between the growing tumor and its environ-
ment, that is, in our approach we use the Von Mises stress to
spatially and temporally restrict tumor cell diffusion [4, 6, 12]
using Eq. 7:
D ðx; y; z; t Þ ¼ D 0 eλD σ vm ðx;y;z;t Þ , ð7Þ
where D0 represents the diffusion coefficient of tumor cells in the

absence of mechanical restrictions and λD is a stress-tumor cell
diffusion coupling constant.
In this chapter, we will discuss how to implement this model
system using the finite difference method as well as how to individ-
ualize this model using an individual patient’s imaging data. Non-
invasive imaging measurements from diffusion-weighted magnetic
resonance imaging (DW-MRI [16]) and contrast enhanced MRI
(CE-MRI, [17]) are used to estimate the spatial distribution of
tumor cell number in a murine model of glioma at several experi-

mental time points. The in vivo estimated cell number then pro-
vides the initial tumor cell distribution and is also used to solve an
inverse problem to return estimates of the model parameters. The
estimated model parameters can then be used to simulate future
tumor growth.
2 Materials
2.1 Dataset The numerical methods presented in this chapter use an in vivo
dataset acquired in rats with intracranially inoculated glioma cells
[5, 18, 19]. Alternatively, an in silico dataset can also be used
[5]. For both approaches the dataset should contain:
1. Three-dimensional estimates of the distribution of tumor cells
at several time points.
2. Three-dimensional map of k (or initial guess).
3. Single value for D0 (or initial guess).
4. Values for G, ν, λD, λf, and θ (based on literature, calculation, or
assignment, see Note 1).
For use in Matlab this dataset should be saved as a “.mat” file
consisting of a 4D array of tissue cellularity, a 3D array of k values,
and one-element arrays of D0, G, ν, λD, λf, and θ all with double
precision.
2.2 Software/ The forward evaluation and parameter optimization of the mechan-
Hardware ically coupled model was ran on a Dell PowerEdge R820 server
Requirements consisting of four Intel Xenon E5–4610 2.3 GHz processors with a
total of 256 GB of memory using Matlab 2015b. The forward
evaluation is relatively less computationally intensive and takes less
than 16 s for a 10 day simulation on a laptop with 8 GB of memory
and an Intel i5-2550 M 2.5 GHz processor. The parameter optimi-
zation computation time, however, depends on both the number of
parameters being estimated and the number of iterations of the
optimization algorithm until stopping criteria are met. Paralleliza-
tion of the parameter perturbation code can reduce computation
time by a factor approximately equal to the number of parallel
threads. (For example parameter perturbation for 100 parameters
takes 13.1 min with 1 thread, 3.1 min with 4 threads, 1.7 min with
8 threads, 0.9 min with 16 threads, and 0.7 min with 32 threads.)
3 Methods
3.1 Animal While details are presented in [5], we here discuss the salient
Experiments features of the experimental procedure (see Fig. 1). The in vivo
Fig. 1 Experimental timeline and estimation of in vivo cell number from DW-MRI data. (a) On day 0, rats are
injected intracranially with 105 C6 glioma cells. (b) Jugular catheters are then inserted on day 8. (c) On days
10 through 20, rats are imaged with MRI with 3D gradient echo, DW-MRI, and CE-MRI. (d) CE-MRI is used to
identify tumor tissue by subtracting pre-contrast image from the post-contrast image. (e) ADC(x, y, z, t ) is
then estimated from DW-MRI data. Finally, N(x, y, z, t ) is estimated (f) within the tumor tissue using Eq. 9 and
ADC(x, y, z, t )
dataset described in this section was acquired in female Wistar rats

inoculated intracranially with C6 Glioma cells (1 105) via stereo-
taxic injection on day 0 (Fig. 1a). On day 8, permanent jugular
catheters were placed in each rat (Fig. 1b). Beginning on day
10, rats are imaged (Fig. 1c), with a 3D gradient echo, DW-MRI
and CE-MRI (see Note 2 for remarks on the experiment timeline
and measurement frequency). The 3D gradient echo data was
collected with a larger field of view (45 mm 45 mm 45 mm)
and larger sampling matrix (256 256 128) for image registra-
tion purposes. The DW-MRI and CE-MRI data was acquired with
a 32 mm 32 mm 16 mm field of view and a 128 128 16
sampling matrix. During the CE-MRI acquisition, a 200 μL bolus
(0.05 mmol/kg) of gadolinium-diethylenetriamine pentaacetic
acid, an MRI contrast agent, is injected to identify tumor regions
(Fig. 1d). Areas of signal enhancement in the post-contrast

CE-MRI data were used to identify tumor regions of interest
(ROI). Tumor cellularity (N(x, y, z, t)) was estimated from
DW-MRI. Briefly, DW-MRI is an imaging method that is sensitive
to the diffusion of water within tissue, and several groups have
observed an inverse relationship between the apparent diffusion
coefficient (ADC) and cellularity [20–24]. The ADC is estimated
voxel-wise from DW-MRI (Fig. 1e) data acquired at several b-
values by fitting Eq. 8 to the acquired signal at each b-value:
S ðx; y; z; b Þ ¼ S 0 ðx; y; z Þ ebADCðx;y;z Þ , ð8Þ
where S(x, y, z, b) is the acquired signal at three-dimensional posi-
tion (x,y,z) and b-value b, S0(x, y, z) is the intrinsic signal, and ADC
(x, y, z) is the apparent diffusion coefficient. The tumor ROI iden-
tified from CE-MRI is then applied as a mask to ADC(x, y, z)
(Fig. 1f), to estimate cellularity only within the tumor using Eq. 9:

ADCw ADCðx; y; z Þ
N ðx; y; z Þ ¼ θ , ð9Þ
ADCw ADCmin
where θ is the maximum tumor cell carrying capacity, ADCw is the

ADC of water at 37 C (2.5 103 mm2/s) [25], ADC(x, y, z) is
the ADC value at position (x,y,z), and ADCmin is the minimum
ADC value which corresponds to the voxel with the largest number
of cells [2]. θ can be calculated using the imaging voxel dimensions
(0.25 mm 0.25 mm 1.00 mm), and assuming spherical tumor
cells with a packing density of 0.7405 [26] and an average cell
volume of 908 μm3 [27] (see Note 3 for further remarks on packing
density and cell volume).
A voxel-wise k and a global D0 are estimated from serial mea-
surements of N(x, y, z, t) in a parameter optimization procedure
[5]. G is assigned from literature values to anatomical regions
identified in imaging data (e.g., cortex, corpus callosum, hippo-
campus, thalamus, putamen) [28, 29], while ν is set to 0.45 (as we
assume that tissue is nearly incompressible). λD can be assigned or a
range of values can be evaluated to apply different degrees of
mechanical coupling, while λf is set to 1.
3.2 Modeling We now discuss the details of the finite difference simulation for
Eqs. 1 and 2, the forward evaluation of the model system, and the
parameter optimization and the tumor growth prediction
approach. Figure 2 shows an overview of the data collection,
parameter optimization, and prediction approach. Briefly, data is
acquired from ti to tf. A subset of the total data (days ti to tn, where
tn is less than tf) are first used to determine the optimal model
parameters. Once the stopping criteria are met for the parameter
optimization approach, the optimized model parameters are then
Fig. 2 Tumor growth modeling and prediction flow chart. DW-MRI and CE-MRI data is first acquired in rats at
days ti to tf. A subset of the total data (ti to tn) is used to first estimate model parameters using an iterative
optimization algorithm. The optimized model parameters are then used in a forward evaluation of the model
system to predict tumor growth at the remaining data points (tn + 1 to tf). The error is then assessed between
the model and measured values of N(x, y, z, t )
used in a forward evaluation of the model to simulate future tumor

growth. The measured data is then compared to the model pre-
dicted growth on days tn+1 to tf. With respect to the clinical con-
text, tn would represent the time point at which early-course of
therapy data could be collected, and calibrated to the patient. Once
complete, assessments on efficacy of therapy would be forecasted in
silico for future time point tf and perhaps lead to changes to therapy
regimen or alternate therapies.
3.2.1 Finite Difference As an illustrative example for clarity, we show the derivation of the
Simulation Setup finite difference model for a 1D implementation, followed by
extending the model to the full 3D implementation. A Taylor series
expansion is used to derive the finite difference approximation of
the tumor cell model (Eq. 1) as shown for the 1D implementation
in Eq. 10:

N ðx; t þ h t Þ N ðx; t Þ δN ðx; t Þ δD ðx Þ
¼ þ D ðx Þ
ht 2h x 2h x
!
δ2 N ðx; t Þ
þ kðx Þ N ðx; t Þ
h 2x

N ðx; t Þ
1 , ð10Þ
θ
where ht is the time step, and hx is the grid spacing in the x-
direction, and δ represents the central difference operator, defined
below in Eqs. 11 and 12. Finite difference approximations are
derived using a full grid approach to take advantage of the natural,
voxelized gridding from the experimental imaging data
measurements. The central difference approximation of the first

derivative in (for example) the x-direction is shown in Eq. 11:
∂N ðx; t Þ δN ðx; t Þ N ðx þ h x ; t Þ N ðx h x ; t Þ
¼ : ð11Þ
∂x 2h x 2h x
Similarly, the central difference approximation of the second
derivative in (for example) the x-direction is shown in Eq. 12:
∂2 N ðx; t Þ δ2 N ðx; t Þ

∂x 2 h 2x
N ðx þ h x ; t Þ 2 N ðx; t Þ þ N ðx h x ; t Þ
¼ : ð12Þ
h 2x
In the case of a mesh boundary, where the node at either (x + 1)
or (x 1) does not exist, the zero flux boundary condition (∂N/
∂x ¼ 0) can be used to relateN(x + hx, t) to N(x hx, t) (or vice
versa) as shown in Eq. 13:
N ðx þ h x ; t Þ N ðx h x ; t Þ
¼ 0 ) N ðx þ h x ; t Þ
2h x
¼ N ðx h x ; t Þ: ð13Þ
The 3D implementation of Eq. 1 is shown below in Eq. 14:
!
N ðx; y; z;t þ h t Þ N ðx; y;z;t Þ δN ðx;y;z; t Þ δD ðx; y; z Þ δ2 N ðx;y;z;t Þ
¼ þ D ðx;y;z Þ
ht 2h x 2h x h 2x
!
δN ðx;y;z;t Þ δD ðx;y;z Þ δ2 N ðx; y; z;t Þ
þ þ D ðx; y; z Þ
2h y 2h y h 2y
!
δN ðx;y;z;t Þ δD ðx;y;z Þ δ2 N ðx; y; z;t Þ
þ þ D ðx; y; z Þ
2h z 2h z h2
z
N ðx; y; z;t Þ
þkðx;y;z Þ N ðx;y;z; t Þ 1 :
θ
ð14Þ
The derivation of the finite difference approximation of Eq. 2 is
shown for the 1D implementation in Eqs. 15–17. Equation 2 is first
rewritten in terms of the 1D stress in the x-direction (σ x) in Eq. 15:
∇ σ x ðx Þ λf ∇N ðx; t Þ ¼ 0: ð15Þ
σ x is then replaced with Hooke’s law for a linear elastic isotropic
material (σ x ¼ E εx) in Eq. 16:
∇ ðEεx ðx ÞÞ ¼ λf ∇N ðx; t Þ, ð16Þ

where E is Young’s Modulus, and εx is equal to∂u/∂x. The diver-
gence is then evaluated and the finite difference approximations are
applied in Eq. 17:
!
δE ðx Þ δuðx Þ δ2 uðx Þ δN ðx; t Þ
þ E ðx Þ ¼ λf : ð17Þ
2h x 2h x h 2x 2h x
A similar approach as shown in Eqs. 15–17 can be followed to
obtain the full 3D implementation of Eq. 2. Equations 18–20 show
the finite difference approximation for the 3D implementation of
Eq. 2. Equation 18 shows the x-direction component of Eq. 2:

2ð1 v Þ δG δu δ2 u 2v δG δv δ δv
þG 2 þ þG
1 2v 2h x 2h x h x 1 2v 2h x 2h y 2hx 2hy
2v δG δw δ δw δG δu δ δu
þ þG þ2 þG
1 2v 2h x 2h z 2h
x 2h z 2h x 2h y 2h x 2h y
δG δu δ δu δN
þ2 þG ¼ λf ,
2h x 2h z 2h x 2h z 2h x
ð18Þ
where u, v, and w represent tissue displacement in the x-, y-, and z-
directions, respectively. Eq. 19 shows the y-direction component of
Eq. 2:
!
2ð1 v Þ δG δv δ2 v 2v δG δu δ δu
þG 2 þ þG
1 2v 2h y 2h y hy 1 2v 2h y 2h x 2h y 2h x

2v δG δw δ δw δG δv δ δv
þ þG þ2 þG
1 2v 2h y 2h z 2h y 2h z 2h y 2h x 2h y 2h x
δG δv δ δv δN
þ2 þG ¼ λf :
2h y 2h z 2h y 2h z 2h y
ð19Þ
Equation 20 shows the z-direction component of Eq. 2:

2ð1 v Þ δG δw δ2 w 2v δG δu δ δu
þG 2 þ þG
1 2v 2h z 2h z h z 1 2v 2h z 2h x 2hz 2h x
2v δG δv δ δv δG δw δ δw
þ þG þ2 þG
1 2v 2h z 2h y 2h
z 2h y 2h z 2h x 2h z 2h x
δG δw δ δw δN
þ2 þG ¼ λf :
2h z 2h y 2h z 2h y 2h z
ð20Þ
The unknown tissue displacements u, v, and w are solved by
rewriting Eqs. 18–20 into a matrix system shown in Eq. 21:
½MfUg ¼ λf f∇Ng, ð21Þ

where ½M is a square 3n 3n matrix of the finite difference
coefficients, fUg is equal to {u1, un, v1, vn, w1, wn}T,
where ui, vi, and wi represent the displacement at the ith node in
the x-, y-, and z-direction, respectively. f∇Ng is equal to {∂N1/
∂x, ∂Nn/∂x, ∂N1/∂y, ∂Nn/∂y, ∂N1/∂z, ∂Nn/∂z}T,
where ∂Ni/∂x, ∂Ni/∂y, and ∂Ni/∂z represent the gradient at the
ith node in the x-,y-, and z-direction, respectively. Rows 1 through

n of ½Mrepresent coefficients for Eq. 18, rows n + 1 through 2n of
½Mrepresent the coefficients for Eq. 19, and 2n + 1 through 3n of
½Mrepresent the coefficients for Eq. 20. Rows 1 through n of fUg
and f∇Ng represent the x-direction components (u and ∂N/∂x,
respectively), rows n + 1 through 2n of fUg and f∇Ng represent the
y-direction components (v and ∂N/∂y, respectively), and rows
2n + 1 through 3n of fUg and f∇Ng represent the z-direction
components (w and ∂N/∂z, respectively). ½M is built only once
and can be factorized into lower and upper triangular matrices
(refer to Note 4 for further details on the construction and solving
of Eq. 21). Equations 1 and 2 are solved using a three dimension in
space (grid spacing: 250 250 1000 μm), fully explicit in time
(for Eq. 1, time step ¼ 0.01 days) finite difference simulation.
(Refer to Note 5 for details on selecting an appropriate time
step.) Equation 1 has no diffusive flux at the brain tissue boundaries
(Neumann boundary condition [30]). Equation 2 has no tissue
displacement in the Cartesian direction of the boundary (Dirichlet
boundary condition), while displacement in the other Cartesian
directions is unknown (slip condition [31]).
3.2.2 Forward Evaluation A summary and example of the forward evaluation algorithm is
presented in Fig. 3. The forward evaluation begins with solving the
mechanical model (steps 1 through 4 in Fig. 3). At the beginning
of each iteration, the gradient of the current distribution of tumor
cells, ∇N(x, y, z, t), is calculated and is assigned to f∇Ng (step 1 in
Fig. 3). fUg is then solved for in Eq. 21 (step 2 in Fig. 3). The
strains (Eq. 4) and stresses (Eqs. 5 and 6) are calculated (step 3 in
Fig. 3). σ vm(x, y, z, t) is then used to update D(x, y, z, t) (Eq. 7,
step 4 in Fig. 3). Finally, D(x, y, z, t) is used in the evaluation of
Eq. 1 to determine N(x, y, z, t + 1) (step 5 in Fig. 3). The forward
evaluation of the model system is then repeated at each simulation
time step.
3.3 Parameter The optimal model parameters are determined using an iterative
Optimization and Levenberg-Marquardt [32, 33] weighted least squares
Tumor Growth optimization:
Prediction h i
J T WJ þ α D J T WJ fΔβg ¼ J T W fN meas N model ðβÞg, ð22Þ
where J is the Jacobian matrix, W is a diagonal weighting matrix, α
is a damping parameter, D J T WJ is a diagonal matrix consisting of the
diagonal elements of JTWJ, {Δβ}is as vector of updates to model
parameters, {Nmeas} is a vector of the measured cell number, and
{Nmodel(β)} is a vector of the model described cell number using the
current best set of parameters β. J is a (n (number of voxels) nt
(number of time points)) by p (the number of model parameters)
matrix, W is a (n nt) (n nt) matrix, has p components, and
{Nmeas} has (n nt) components. J can be estimated using
Fig. 3 Algorithm and example forward evaluation of mechanical and tumor cell model. The mechanical model
is first solved to calculate the tissue displacement vector {U} due to N(x, y, z, t ), Eq. 21. {U} is then used to
calculate strain, stress, and σ vm(x, y, z, t ). The new value of D(x, y, z, t )is calculated using Eq. 2 and
σ vm(x, y, z, t ). Finally, D(x, y, z, t ) is used in Eq. 6 to calculate the value of N(x, y, z, t + 1)
numerical differentiation (refer to Note 6 for further comments on

J). For example, the J element at row i and column j, Eq. 23,
represents the partial derivative of the model cell number at node
i with respect to the jth model parameter and is calculated by
individually perturbing model parameters as described below:
∂N i N model ði; βalt Þ N model ði; βÞ

J i, j ¼ ¼ , ð23Þ
∂βj βalt, j βj
where Nmodel(i, βalt) is the model cell number at the ith index of
{Nmodel} using parameters βalt, Nmodel(i, β) is the model cell num-
ber at the ith index of {Nmodel} using parameters β. βalt is equal to β
at all indices except for the jth index which is perturbed by a factor
f (i.e., βalt , j ¼ f βj). (Note f should be a number close to but not
equal to 1. In this work, we assign f ¼ 1.001.) W is a square matrix
with n nt rows and columns. W weights the elements of J by the

reciprocal of the total number of cells at each time point. This
weighting is included to balance the influence of later time points
to the earlier time points (which often have much fewer nonzero
voxel measurements compared to the later time points). For nt ¼ 2,
Wi,i is calculated using Eq. 24:
8 !1
>
> X
j ¼n
>
> in ðN meas ðj ; t ¼ 1ÞÞ
>
<
j ¼1
W i, i ¼ !1 : ð24Þ
>
> X
j ¼n
>
>
>
: i > n and i 2n ðN meas ðj ; t ¼ 2ÞÞ
j ¼1
Figure 4 summarizes the parameter optimization approach

used to estimate model parameters kðx, y, z Þ and D0. The model is
initially evaluated with a guess of the model parameters (step 1 in
Fig. 4 Iterative parameter optimization approach. A schematic is shown above

for the iterative parameter optimization algorithm using the Levenberg-
Marquardt method [32, 33]. The model is first evaluated with an initial guess
of model parameters, line 1. The objective function is then evaluated with the
current set of model parameters, line 2. The optimal model parameters are then
determined in an iterative “while-loop” which ceases when stopping criteria are
met. At the beginning of each iteration, the Jacobian is built, line 3, and is used
to solve for the new guess of model parameters, line 4. The model is then
re-evaluated with the new model parameters, line 5, and the objective function
is calculated, line 6. Finally, the error is compared to the previously observed
lowest error to determine if the new parameter values are acceptable. The
optimization ceases when the stopping criteria are met
Fig. 4). A guess of β is used to evaluate the objective function

described in Eq. 25 (step 2 in Fig. 4):
0 !1 !1
X
tn X
i¼n X
i¼n
Error ¼ @ ðN meas ði; t ÞÞ ðN model ði;t;βÞ N meas ði; t ÞÞ2 A:
t¼t 1 i¼1 i¼1
ð25Þ
The initial evaluation of Eq. 25 sets the current minimum error

or Error(β). J, W, and D J T WJ are then built (step 3 in Fig. 4). The
parameter update vector {Δβ} is then calculated using Eq. 22 and
then added to {β} for the current guess of model parameters {βtest}
(step 4 in Fig. 4). The forward evaluation of the model is per-
formed using model parameters {βtest} (step 5 in Fig. 4). Equation
25 is then re-calculated using {βtest} (step 6 in Fig. 4). The error
evaluated using {βtest} or Error(βtest) is compared to Error(β). If
Error(βtest) is less than Error(β), {βtest} is accepted (i.e., {β} ¼ {
βtest}) and α decreased by a factor of 12. If Error(βtest) is greater
than Error(β), {βtest} is rejected and α increased by a factor of
3. (Note, the factors that α is increased or decreased by (3 and
12 in this work) are often problem-specific and need to be empiri-
cally determined to improve convergence.) At this point, the stop-
ping criteria are also evaluated. The stopping criterion can be a
maximum number of iterations, a minimal threshold of error, or a
minimal relative change in model error between successful itera-
tions, or a minimal relative change in model [34] between success-
ful iterations. In general, error will never reach zero for this type of
system so selecting a stopping criteria that is sensitive to the relative
change in error or parameter values will indicate convergence. The
parameter optimization process continues by returning to step 3
until the stopping criteria are met.
At the conclusion of the parameter optimization process, the
optimized model parameters are used in a final forward evaluation
of the model from ti to tf. The error between Nmodel(x, y, z, t) and
Nmeas(x, y, z, t)is assessed at the time points not used in the param-
eter optimization tn + 1 to tf.
3.4 Summary and In this chapter, a modeling and experimental framework was
Outlook described which can be used to individualize a predictive biophysi-
cal model from an individual patient’s imaging data. Clinically
available imaging measurements from CE-MRI and DW-MRI
were used to provide serial estimates of tumor cell number that
were then used in an inverse problem to optimize model parameters
for the measured tumor. These individually optimized model para-
meters could then be used to predict future growth or response.
For example, acquiring data early in the course of a patient’s ther-
apy could be used to calibrate a patient-specific model that could
then be used to predict the efficacy of the current treatment weeks

or months before response is identifiable through standard criteria
(e.g., the Response Evaluation Criteria in Solid Tumors [34]). For
predicted non-responders, the calibrated model could potentially
be used to evaluate other treatment regimens to adapt clinical care
to improve the outcome on an individual patient basis. While this is
a promising avenue for the future of clinical cancer care, further
development of predictive biophysical models is needed to charac-
terize patient response to a variety of available patient
treatments [35].
4 Notes
1. When collecting a new dataset or evaluating this model in a

different disease setting, model parameters should be measured
or estimated on an individual basis. When this is not the case,
however, model parameters should be assigned (or calculated)
from literature values (e.g., G, ν, θ) obtained from experiments
that most closely match the tumor or tumor location that is
currently under investigation. For model parameters that can-
not be measured experimentally or assigned from literature
(e.g., λD, λf) can be assigned empirically based on results
observed in a cohort. Sensitivity analysis (e.g., [36]) of the
model system can also be used to help determine which
model parameters require assignment on an individual basis
and which model parameters may be assigned for the cohort.
2. The experimental time line may change depending on the
particular cancer under investigation, its growth rate, and the
initial size of the tumor. We selected day 10 to start our
imaging experiments, as the tumors are approximately
20–40 mm3 and typically extend over multiple imaging slices.
3. To calculate the physical carrying capacity (i.e., the maximum
number of cells a space can contain) assumptions will need to
be made about the overall tissue structure and cellular shape
which can be verified through histological observations of the
tissue. For the C6 line, we assumed that the tumor cell tissue
was predominately composed of spherical tumor cells with a
packing density and an average cell volume obtained from the
literature [26, 27]. When comparing between the DW-MRI
estimate of cellularity and the model predicted cellularity the
precise values for packing density and average cell volume are
not critical as long as the same carrying capacity is used in both
the model and the ADC to cellularity calculation. However,
when comparing to histological data, more care is required to
match the average size, shape, and packing density of the tumor
cells to what is observed in vivo. Packing density can be
calculated from Hematoxylin and Eosin (H&E) stained tissue

sections by calculating the fraction of the H&E stained area
over the total tumor ROI. The average cell area can then be
calculated as the total occupied area (packing density multi-
plied by total ROI area) divided by the number of positive
stained Hematoxylin cells. The average cell area can then be
used to calculate an average cell radius and volume. In H&E
stained sections obtained in one rat we calculated an average
packing density of 0.764 0.054% (mean 95% confidence
interval) and an average volume of 982 247 μm3.
4. The coefficient matrix ½Mis a sparse and potentially very large
(3n 3n) matrix. To conserve memory and accelerate compu-
tational
time,
½M can be represented by a sparse matrix
Mcompact which is an nz 3 matrix, where nz is the number
of nonzero elements of ½M, and the three columns represent
the matrix nonzero entry, the entry’s matrix row, and entry’s
matrix column entry, respectively. While many sparse matrix
data formats exist, in this realization we used the format native
to MATLAB. With respect to solution methods associated with
sparse matrices, standardly some form of iterative approach
would be adopted with an accompanying matrix precondition
method to increase speed of calculation. In this realization, we
employed one of the available MATLAB methods, namely, the
bi-conjugate gradient stabilized method with an incomplete
LU factorization as a preconditioner.
5. The simulation time step, ht, is selected to maintain numerical
stability for a range of diffusion coefficients for the parameter
optimization
process. To be stable, the product
D h t 1=h 2x þ 1=h 2y þ 1=h 2z must be less than 1/2, or for iso-
tropic dimensions the productD ht/h2must be less than 1/6.
To be monotonic and stable, the product
D h t 1=h 2x þ 1=h 2y þ 1=h 2z must be less than 1/4, or for iso-
tropic dimensions the product D ht/h2must be less than 1/12.
6. Building or updating the Jacobian matrix, J, can be time inten-
sive as the number of model parameters increases as Eq. 23
(and thus a full model evaluation) needs to be evaluated for
each model parameter perturbation. Parallelized code can be
used to simultaneously build several columns of J at a time,
dramatically decreasing the computation time. For example,
non-parallelized code takes approximately 13.1 min per
100 parameters, while parallelized code divided among
32 threads takes 0.7 min per 100 parameters. Alternatively,
approaches such as Broyden’s method [37] can be used to
update J at each iteration while only building the full J matrix
in the first iteration. Briefly, Broyden’s method is a secant
method update that estimates J at the nth iteration based on

the previous J, the difference between the model evaluation at
the (n 1) and (n 2) iterations, and the difference between
model parameters at the (n 1) and (n 2) iterations.
Acknowledgments
This work was supported through funding from CPRIT RR160005

and the National Cancer Institute U01CA174706, K25CA204599,
and R01CA186193, from the National Institute of Neurological
Disorders and Stroke R01NS049251 and the Vanderbilt-Ingram
Cancer Center Support Grant (NIH P30CA68485).
References
1. Yankeelov TE, Quaranta V, Evans KJ, Rericha 8. Corwin D, Holdsworth C, Rockne RC, Trister
EC (2015) Toward a science of tumor forecast- AD, Mrugala MM, Rockhill JK et al (2013)
ing for clinical oncology. Cancer Res Toward patient-specific, biologically optimized
75(6):918–923 radiation therapy plans for the treatment of
2. Atuegwu NC, Gore JC, Yankeelov TE (2010) glioblastoma. PLoS One 8(11):e79115
The integration of quantitative multi-modality 9. Hogea C, Davatzikos C, Biros G (2008) An
imaging data into mathematical models of image-driven parameter estimation problem
tumors. Phys Med Biol 55(9):2429–2449 for a reaction-diffusion glioma growth model
3. Atuegwu NC, Colvin DC, Loveless ME, Xu L, with mass effects. J Math Biol 56(6):793–825
Gore JC, Yankeelov TE (2012) Incorporation 10. Liu Y, Sadowski SM, Weisbrod AB,
of diffusion-weighted magnetic resonance Kebebew E, Summers RM, Yao J (2014)
imaging data into a simple mathematical Patient specific tumor growth prediction
model of tumor growth. Phys Med Biol 57 using multimodal images. Med Image Anal 18
(1):225–240 (3):555–566
4. Weis JA, Miga MI, Arlinghaus LR, Li X, Chak- 11. Konukoglu E, Clatz O, Menze BH, Stieltjes B,
ravarthy AB, Abramson V et al (2013) A Weber M-A, Mandonnet E et al (2010) Image
mechanically coupled reaction-diffusion guided personalization of reaction-diffusion
model for predicting the response of breast type tumor growth models using modified
tumors to neoadjuvant chemotherapy. Phys anisotropic eikonal equations. IEEE Trans
Med Biol 58(17):5851–5866 Med Imaging 29:77–95
5. Hormuth DA II, Weis JA, Barnes SL, Miga MI, 12. Garg I, Miga MI (2008) Preliminary investiga-
Rericha EC, Quaranta V et al (2015) Predicting tion of the inhibitory effects of mechanical
in vivo glioma growth with the reaction diffu- stress in tumor growth. Proc SPIE
sion equation constrained by quantitative mag- 29:69182L-11
netic resonance imaging data. Phys Biol 12 13. Venes D (2013) Taber’s® cyclopedic medical
(4):46006 dictionary, 22nd edn. F. A. Davis Company,
6. Weis JA, Miga MI, Arlinghaus LR, Li X, Philadelphia, PA
Abramson V, Chakravarthy AB et al (2015) 14. DeAngelis LM (2001) Brain tumors. N Engl J
Predicting the response of breast cancer to Med 344(2):114–123
neoadjuvant therapy using a mechanically cou- 15. Helmlinger G, Netti PA, Lichtenbeld HC,
pled reaction-diffusion model. Cancer Res 75 Melder RJ, Jain RK (1997) Solid stress inhibits
(22):4697–4707 the growth of multicellular tumor spheroids.
7. Baldock A, Rockne R, Boone A, Neal M, Nat Biotechnol 15(8):778–783
Bridge C, Guyman L et al (2013) From 16. Padhani AR, Liu G, Mu-Koh D, Chenevert
patient-specific mathematical neuro-oncology TL, Thoeny HC, Takahara T et al (2009)
to precision medicine. Front Oncol 3:62 Diffusion-weighted magnetic resonance
imaging as a cancer biomarker: consensus and for cell aggregation analysis and cell aggrega-
recommendations. Neoplasia 11(2):102–125 tion in in vitro chondrogenesis. Cytometry 28
17. Yankeelov TE, Gore JC (2009) Dynamic con- (2):141–146
trast enhanced magnetic resonance imaging in 27. Rouzaire-Dubois B, Milandri JB, Bostel S,
oncology: theory, data acquisition, analysis, Dubois JM (2000) Control of cell proliferation
and examples. Curr Med Imaging Rev 3 by cell volume alterations in rat C6 glioma
(2):91–107 cells. Pflugers Arch 440(6):881–888
18. Barth R, Kaur B (2009) Rat brain tumor mod- 28. Elkin BS, Ilankovan AI, Morrison B III (2011)
els in experimental neuro-oncology: the C6, A detailed viscoelastic characterization of the
9L, T9, RG2, F98, BT4C, RT-2 and CNS-1 P17 and adult rat brain. J Neurotrauma
gliomas. J Neuro-Oncol 94(3):299–312 28:2235
19. Hormuth DA II, Weis JA, Barnes SL, Miga MI, 29. Lee SJ, King MA, Sun J, Xie HK, Subhash G,
Rericha EC, Quaranta V, Yankeelov TE Sarntinoranont M (2014) Measurement of vis-
(2017). A mechanically-coupled reaction-dif- coelastic properties in multiple anatomical
fusion model that incorporates intra-tumoral regions of acute rat brain tissue slices. J Mech
heterogeneity to predict in vivo glioma growth. Behav Biomed Mater 29:213–224
J R Soc Interface 14:128 30. Lynch D (2005) Numerical partial differential
20. Barnes SL, Sorace AG, Loveless ME, Whise- equations for environmental scientsits and
nant JG, Yankeelov TE (2015) Correlation of engineers: a first practical course. Springer,
tumor characteristics derived from DCE-MRI New York, NY
and DW-MRI with histology in murine models 31. Miga MI, Paulsen KD, Lemery JM, Eisner SD,
of breast cancer. NMR Biomed 28 Hartov A, Kennedy FE et al (1999) Model-
(10):1345–1356 updated image guidance: initial clinical experi-
21. Anderson AW, Xie J, Pizzonia J, Bronen RA, ences with gravity-induced brain deformation.
Spencer DD, Gore JC (2000) Effects of cell IEEE Trans Med Imaging 10:866–874
volume fraction changes on apparent diffusion 32. Levenberg K (1944) A method for the solution
in human cells. Magn Reson Imaging 18 of certain non-linear problems in least squares.
(6):689–695 Q J Appl Mathmatics II(2):164–168
22. Guo Y, Cai Y-Q, Cai Z-L, Gao Y-G, An N-Y, 33. Marquardt DW (1963) An algorithm for least-
Ma L et al (2002) Differentiation of clinically squares estimation of nonlinear parameters. J
benign and malignant breast lesions using Soc Ind Appl Math 11(2):431–441
diffusion-weighted imaging. J Magn Reson 34. Eisenhauer EA, Therasse P, Bogaerts J,
Imaging 16(2):172–178 Schwartz LH, Sargent D, Ford R et al (2009)
23. Sugahara T, Korogi Y, Kochi M, Ikushima I, New response evaluation criteria in solid
Shigematu Y, Hirai T et al (1999) Usefulness of tumours: revised RECIST guideline (version
diffusion-weighted MRI with echo-planar 1.1). Eur J Cancer 45(2):228–247
technique in the evaluation of cellularity in 35. Yankeelov TE, Atuegwu N, Hormuth DA,
gliomas. J Magn Reson Imaging 9(1):53–60 Weis JA, Barnes SL, Miga MI et al (2013)
24. Humphries PD, Sebire NJ, Siegel MJ, Olsen Clinically relevant modeling of tumor growth
ØE (2007) Tumors in pediatric patients at and treatment response. Sci Transl Med 5
diffusion-weighted mr imaging: apparent dif- (187):187ps9
fusion coefficient and tumor cellularity. Radiol- 36. Marino S, Hogue IB, Ray CJ, Kirschner DE
ogy 245(3):848–854 (September 2008) A methodology for
25. Whisenant JG, Ayers GD, Loveless ME, Barnes performing global uncertainty and sensitivity
SL, Colvin DC, Yankeelov TE (2014) Asses- analysis in systems biology. J Theor Biol 254
sing reproducibility of diffusion-weighted (1):178–196
magnetic resonance imaging studies in a 37. Broyden CG (1965) A class of methods for
murine model of HER2+ breast cancer. Magn solving nonlinear simultaneous equations.
Reson Imaging 32(3):245–249 Math Comput 19(92):577–593
26. Martin I, Dozin B, Quarto R, Cancedda R,
Beltrame F (1997) Computer-based technique
Chapter 12
Profiling Tumor Infiltrating Immune Cells with CIBERSORT

Binbin Chen, Michael S. Khodadoust, Chih Long Liu, Aaron M. Newman,
and Ash A. Alizadeh
Abstract
Tumor infiltrating leukocytes (TILs) are an integral component of the tumor microenvironment and have
been found to correlate with prognosis and response to therapy. Methods to enumerate immune subsets
such as immunohistochemistry or flow cytometry suffer from limitations in phenotypic markers and can be
challenging to practically implement and standardize. An alternative approach is to acquire aggregative high
dimensional data from cellular mixtures and to subsequently infer the cellular components computationally.
We recently described CIBERSORT, a versatile computational method for quantifying cell fractions from
bulk tissue gene expression profiles (GEPs). Combining support vector regression with prior knowledge of
expression profiles from purified leukocyte subsets, CIBERSORT can accurately estimate the immune
composition of a tumor biopsy. In this chapter, we provide a primer on the CIBERSORT method and
illustrate its use for characterizing TILs in tumor samples profiled by microarray or RNA-Seq.
Key words Cancer immunology, Deconvolution, Support vector regression (SVR), Tumor infiltrat-
ing leukocytes (TILs), Tumor microenvironment, Tumor heterogeneity, Gene expression, Microarray,
RNA-Seq, TCGA
1 Introduction
Neoplastic cells reside within a complex tumor microenvironment

necessary for tumor growth and survival. Numerous
non-neoplastic cell types including tumor-infiltrating leukocytes
(TILs) comprise the tumor stroma. This immune infiltrate is
often a heterogeneous mixture of immune cells that includes both
innate and adaptive immune populations, and cell types associated
with active (e.g., cytotoxic T lymphocytes) and suppressive (e.g.,
regulatory T cells, myeloid-derived suppressor cells) immune func-
tions. The significance of TILs varies by cancer histology, with the
presence of certain immune subsets often exhibiting a beneficial
prognostic effect in one malignancy but a detrimental effect in
another cancer type [1, 2]. The importance of TIL assessment
continues to grow with the development of novel
243
244 Binbin Chen et al.
immunotherapeutic agents designed to target these cells. Recent

studies have found that T lymphocyte subsets (e.g., CD8+) may
predict response to existing and emerging immunotherapies, high-
lighting the importance of investigating tumor-associated immune
cells as potential predictive biomarkers [3–5].
Measurement of the tumor immune infiltrate has traditionally
been evaluated by histology on tissue sections and immune subsets
have been inferred by immunohistochemistry of individual mar-
kers. However, immunophenotyping typically requires multiple
parameters to accurately subset populations, and thus immunohis-
tochemistry is unable to identify many immune populations and
performs poorly at capturing functional phenotypes (e.g., activated
vs. resting lymphocytes) [6]. Flow cytometry is an alternative
method of quantifying immune infiltrates that enables simulta-
neous measurement of multiple parameters. However, this method
requires prompt and careful processing of samples as well as tissue
disaggregation, which may result in the loss of fragile cell types and
the distortion of gene expression profiles. While flow cytometry can
assess multiple markers, this number is still limited, potentially
excluding markers that may better discriminate closely related cell
populations [7].
In contrast, gene expression profiling of bulk tissues does not
depend on surface markers and does not suffer from artifacts
related to cellular dissociation. Samples can be readily processed
and stored in a standardized fashion, mitigating issues that may
confound data collected at different times and from different loca-
tions. Although previous studies of bulk tumor samples revealed a
number of immune-enriched gene signatures with prognostic sig-
nificance [8, 9], linking these signatures to specific TIL phenotypes
has been challenging [10–14]. Methods for mathematically separ-
ating the bulk tumor gene expression profiles (GEP) into its com-
ponent cell types can overcome this issue.
Several computational tools, including linear least-square
regression (LLSR) [7], microarray microdissection with analysis
of differences (MMAD) [15], and digital sorting algorithm
(DSA) [16], have been applied to the deconvolution of complex
GEP mixtures to infer cellular composition. Although these
approaches are effective for enumerating highly distinct cell types
in mixtures with minimal unknown content (e.g., lymphocytes,
monocytes, and neutrophils in whole blood), they are sensitive to
experimental noise, high unknown mixture content, and closely
related cell types, limiting their utility for TIL assessment [17, 18].
CIBERSORT, a computational approach developed by our
group, aims to address these challenges (see Fig. 1) [17]. Like
other methods, CIBERSORT requires a specialized knowledgebase
of gene expression signatures, termed a “signature matrix,” for the
deconvolution of cell types of interest. However, in contrast to
previous efforts, CIBERSORT implements a machine learning
Profiling Tumor Infiltrating Immune Cells with CIBERSORT 245
Bulk
tissue/ Blood
tumor draw
RNA
Purify profile OR
Cell proportions
Signature matrix
RNA
profile
Significance
CIBERSORT analysis
Fig. 1 Overview of CIBERSORT. As input, CIBERSORT requires a “signature matrix” comprised of barcode
genes that are enriched in each cell type of interest. Once a suitable knowledgebase is created and validated,
CIBERSORT can be applied to characterize cell type proportions in bulk tissue expression profiles. Although
originally validated using a signature matrix containing 22 functionally defined human immune subsets (LM22)
profiled by microarrays, CIBERSORT is a general framework that can be applied to diverse cell phenotypes and
genomic data types, including RNA-Seq. To quantitatively capture deconvolution confidence, CIBERSORT
calculates several quality control metrics, including a deconvolution p-value
approach, called support vector regression (SVR), that improves

deconvolution performance through a combination of feature
selection and robust mathematical optimization techniques (see
Subheading 1.1 for details). In benchmarking experiments,
CIBERSORT was more accurate than other methods in resolving
closely related cell subsets and in mixtures with unknown cell types
(e.g., solid tissues) [17]. Thus, CIBERSORT is a useful approach
for high-throughput characterization of diverse cell types, such as
TILs, from complex tissues. Here, we provide users with a practical
roadmap for dissecting leukocyte content in tumor gene expression
datasets with CIBERSORT.
1.1 CIBERSORT A common objective of gene expression deconvolution algorithms

Model is to solve the following system of linear equations for f:
m ¼ f B.
m: a vector consisting of a mixture GEP (input requirement).
f: a vector consisting of the fraction of each cell type in the
signature matrix (unknown).
B: a “signature matrix” containing signature genes for cell
subsets of interest (input requirement).
CIBERSORT differs from previous deconvolution methods in
its application of a machine learning technique, ν-support vector
regression (ν-SVR), to solve for f [19]. Briefly, SVR defines a
hyperplane that captures as many data points as possible, given
defined constraints, and reduces overfitting by only penalizing data

points outside a certain error radius (termed support vectors) using
a linear “epsilon-insensitive” loss function. The orientation of the
hyperplane determines f. In the original description of CIBER-
SORT, the support vectors were genes selected from a signature
matrix; however, the CIBERSORT algorithm is completely gener-
alizable and can be applied to diverse genomic features [20]. The
parameter ν determines the lower bound of support vectors and the
upper bound of training errors. CIBERSORT uses a set of ν values
(0.25, 0.5, 0.75) and chooses the value producing the best perfor-
mance (i.e., the lowest root mean square between m and the
deconvolution result f B). In addition, ν-SVR incorporates L2-
norm regularization, which minimizes the variance in the weights
assigned to highly correlated cell types, thereby mitigating issues
owing to multicollinearity.
CIBERSORT also allows users to create a custom signature
matrix. Differentially expressed genes between cell types of interest
are identified by a two-sided unequal variance t-test corrected for
multiple hypothesis testing. A feature selection step is then per-
formed to minimize the condition number, a matrix property that
captures how well the linear system tolerates input variation and
noise. For signature matrices comprised exclusively of immune cell
types, there is an option to filter non-hematopoietic and cancer-
specific genes to reduce the influence of non-immune cells on
deconvolution results. By choosing features that minimize the
condition number, CIBERSORT improves the stability of the sig-
nature matrix and further reduces the impact of multicollinearity.
Additional details of the CIBERSORT method can be found in the
original publication [17].
2 Materials
The general workflow for analyzing RNA admixtures with CIBER-

SORT consists of two key input files (see Fig. 1):
1. The “mixture file” is a single tab-delimited text file containing
1 or more GEPs of biological mixture samples (see Table 1). The
first column contains gene names and should have “Name”
(or similar) as a column header (i.e., in the space occupying
column 1, row 1). Multiple samples may be analyzed in parallel,
with the remaining columns (2, 3, etc.) dedicated to mixture
GEPs, where each row represents the expression value for a
given gene and the column header is the name of the mixture
sample. Note that the mixture file and the signature matrix must
share the same naming scheme for gene identifiers.
2. The “signature matrix” is a tab-delimited text file consisting of
sets of “barcode genes” whose expression values collectively
Table 1
Format of input mixture files (tab separated plain text)
Gene_symbol (required) Mixture 1 Mixture 2 ...

Gene1
Gene2
...
define unique gene expression signatures for each cell subset of

interest. The file format is similar to the mixture file, with gene
names in column 1. The remaining columns consist of signature
GEPs from individual cell subsets. A validated leukocyte gene
signature matrix (LM22) is available for the deconvolution of
22 functionally defined human hematopoietic subsets. LM22
was generated using Affymetrix HGU133A microarray data
[17] and has been rigorously tested on Affymetrix HGU133
and Illumina Beadchip platforms. For the application of LM22
to RNA-Seq data, see Note 1.
Importantly, all expression data should be non-negative,
devoid of missing values, and represented in non-log linear space.
For Affymetrix microarrays, a custom chip definition file (CDF) is
recommended (see Subheading 3.2.2) and should be normalized
with MAS5 or RMA. Illumina Beadchip and single color Agilent
arrays should be processed as described in the limma package.
Standard RNA-Seq expression quantification metrics, such as frag-
ments per kilobase per million (FPKM) and transcripts per kilobase
million (TPM), are suitable for use with CIBERSORT.
In the sections below, we illustrate how CIBERSORT can be
used to analyze complex tissues, whether profiled by microarray
(Subheading 3.2) or RNA-Seq (Subheadings 3.3 and 4). We also
provide instructions for custom signature matrix creation (Subhead-
ing 3.3). Although this protocol focuses on the deconvolution of
gene expression data, CIBERSORT can be applied to other genomic
data types, such as ATAC-Seq, provided that data from purified
components are available on the same platform. Public genomic
data repositories include the NIH Gene Expression Omnibus data-
base (GEO, http://www.ncbi.nlm.nih.gov/geo/) and the NIH
Genomic Data Commons (https://gdc.cancer.gov/). See Subhead-
ings 3.3 and Note 1 for more details.
All files necessary for this protocol, including R and Java imple-
mentations of CIBERSORT, can be downloaded through the links
provided herein or from the CIBERSORT website (http://
cibersort.stanford.edu). The following packages are required for
the standalone R version: “e1071,” “parallel,” and “preproces-
sCore.” The Java version additionally requires the following R
packages: “Rserve” and “colorRamps.” The “affy,” “annotate,”

and “org.Hs.eg.db” R packages are required only if using the R
script from the CIBERSORT website to process Affymetrix CEL
files, as described in Subheading 3.2.2.
3 Methods
3.1 Installation CIBERSORT can be run online (http://cibersort.stanford.edu/)

or downloaded for local use, and is freely available for academic
non-profit research. While the current R script can be used to run
the CIBERSORT deconvolution engine, users wishing to create a
custom signature matrix will need to use the website or the Java
executable. To download and install the R dependencies described
in Materials, run the following commands from an R terminal:
Within R
> install.packages(‘e1071’) #R and Java versions.
> source(http://bioconductor.org/biocLite.R).
> biocLite(‘parallel’) #R and Java versions.
> biocLite(‘preprocessCore’) #R and Java versions.
> biocLite(‘Rserve’) #Java version only.
> biocLite(‘colorRamps’) #Java version only.
> biocLite(‘affy’) # used to normalize Affymetrix CEL files (Sub-
heading 3.2.2).
> biocLite(‘annotate’) # used to annotate Affymetrix CEL files
(Subheading 3.2.2).
> biocLite(‘org.Hs.eg.db’) # used annotate human Affymetrix
CEL files (Subheading 3.2.2).
3.2 Enumerating TIL LM22 is a signature matrix file consisting of 547 genes that accu-
Subsets with LM22 rately distinguish 22 mature human hematopoietic populations
isolated from peripheral blood or in vitro culture conditions,
including seven T cell types, naive and memory B cells, plasma
cells, NK cells, and myeloid subsets. LM22 was designed and
extensively validated using gene expression microarray data, but is
also applicable to RNA-Seq data for hypothesis generation (see
Note 1). Here, we illustrate how to prepare Affymetrix microarray
data for use with LM22, and how to run CIBERSORT with LM22
to characterize the leukocyte composition of prostate biopsies
obtained from patients with prostate cancer and from healthy sub-
jects. To follow the examples in this section, download GSE55945
CEL files from GEO (https://www.ncbi.nlm.nih.gov/geo/down
load/?acc¼GSE55945&format¼file). Processed data for
GSE55945 can be downloaded from the CIBERSORT website.
3.2.1 General Tips for Gene expression data must be preprocessed as specified in Subhead-
Mixture File Preparation ings 2 and 3.2.2. Because LM22 uses HUGO gene symbols (e.g.,
CD8A, MS4A1, CTLA4, etc.), all mixture files need to possess
matching HUGO identifiers. See Note 2 for using non-HUGO
gene symbols. Importantly, all expression values should be in
non-log (i.e., linear) space with positive numerical values and no
missing data. Not all signature matrix genes need to be present in
the mixture expression data, but performance will improve with the
presence of more signature genes.
3.2.2 Preparation of The CIBERSORT website provides an R script to convert Affyme-

Affymetrix CEL Files trix CEL files, the raw data format for Affymetrix microarray experi-
ments, into a tabular format that is ready for analysis with
CIBERSORT (Menu>Download). All packages specified in the
Installation section will need to be downloaded, along with a
custom CDF from BrainArray (http://brainarray.mbni.med.
umich.edu/Brainarray/Database/CustomCDF/20.0.0/entrezg.
asp). The custom CDF must be compatible with the microarray
platform used to profile the mixtures (e.g., for HGU133 Plus 2.0,
download hgu133plus2hsentrezgcdf_20.0.0.tar.gz); the latest
entrezg version is always recommended. Download the custom
CDF and run the following terminal command to install the R
library:
sudo R CMD INSTALL downloaded_customCDF_filename.tar.gz
The user is advised to run this step on a machine with root
access or a self-contained R environment like RGui. Next, navigate
to the directory containing raw Affymetrix CEL files (GSE55945 in
this example) and run CEL_to_mixture.R, an R script that should
be placed in the same folder as the CEL files. The script will output
a correctly formatted CIBERSORT mixture file named: Normal-
izedExpressionArray.customCDF.txt. For this example, rename to
“prostate_cancer.txt.”
3.2.3 Running Before running CIBERSORT, all mixture files need to be uploaded
CIBERSORT (Menu > “Upload Files”). The user needs to select “Mixture”
when uploading mixture files. After uploading the correctly for-
matted mixture file (e.g., prostate_cancer.txt) to the website, go to
“Run CIBERSORT” under Menu (see Fig. 2). Select “LM22
(22 immune cell types)” for “Signature gene file.” When clicking
“Mixture file,” the uploaded mixture file will be one of the options.
Select “Run” after choosing both the mixture file of interest and a
permutation number. At least 100 permutations are recommended
to achieve statistical rigor.
To run CIBERSORT locally in R, navigate to the directory
containing the CIBERSORT.R script, and run the following com-
mands within the R terminal:
Fig. 2 CIBERSORT web interface. All the files except the LM22 gene signature need to be uploaded to the
CIBERSORT website before proceeding to this page. When using LM22, the user will need to select the
uploaded mixture file and specify “LM22 (22 immune cell types)” for the signature gene file. When creating
custom gene signatures, a reference sample file and a phenotype classes file are required, and need to be
uploaded to the webserver. For CIBERSORT to generate a meaningful p-value, we recommend at least
100 permutations; however, this parameter can be set to a small number for exploratory analyses
> source(‘CIBERSORT.R’)
> results <- CIBERSORT(‘sig_matrix_file.txt’,‘mixture_file.txt’,
perm¼100, QN¼TRUE)
Deconvolution output will be saved to a results object in R and
written to disk as CIBERSORT-results.txt in the same directory.
In this example, sig_matrix_file.txt should be “LM22.txt”
(obtain under Menu>Download); mixture_file.txt should be
“prostate_cancer.txt”; perm is an integer number for the number
of permutations; and QN is a Boolean value (TRUE or FALSE) for
performing quantile normalization. QN is set to TRUE by default
and recommended when the gene signature matrix is derived from
several different studies or sample batches.
3.2.4 Interpretation of Once the online analysis is complete, the website will output a
Results stacked bar plot (see Fig. 3) and a heat map (see Fig. 4). The output
Fig. 3 Inferred composition of 22 immune cell subsets in malignant and normal prostate biopsies (related to
Subheading 3.2). The results were generated using CIBERSORT and the built-in LM22 immune cell gene
signature, and the stacked bar plot display was automatically generated by the CIBERSORT webserver
Fig. 4 Estimated proportions of six major leukocyte subsets (B cells, CD8 T cells, CD4 T cells, NK cells,
monocytes/macrophages, neutrophils) in skin cutaneous melanoma tumor biopsies profiled by The Cancer
Genome Atlas (TCGA). The results were determined using a custom RNA-Seq leukocyte signature matrix
(“LM6,” Subheading 3.3.3), and the heat map figure was generated by the CIBERSORT webserver
includes a p-value for the global deconvolution of each sample. A p-

value threshold <0.05 is recommended. By default, deconvolution
results are expressed as relative fractions normalized to 1 (e.g., frac-
tions of total leukocyte content). Researchers interested in studying
absolute levels of immune cells should refer to Subheading 3.6.
3.3 TIL A custom signature matrix can be created using data from purified
Characterization with cell populations. While the process to generate a custom matrix
a Custom Signature from expression profiles is straightforward, the performance of a
Matrix custom matrix will depend on the quality of the data used to
generate it. Immunophenotyping of leukocytes is a dynamic field
3.3.1 Generation of with new immune populations continuing to be identified. Care
Expression Profiles for should be taken in determining which immune “cell types” should
Custom Gene Signature be included in the signature matrix and which canonical markers
Matrix Creation should be used to isolate these populations. For example, it is clear
that the population of “CD4-expressing T lymphocytes” encom-
passes heterogeneous populations with diverse functional pheno-
types including naive, memory, Th1, Th2, Th17, T-regulatory
cells, and T follicular helper cells. Replicates for each purified
immune cell type are required to gauge variance in the expression
profile (see Note 4 for further details). The platform and methods
used to generate data for the signature matrix ideally should be
identical to that applied to the analysis of the mixture samples.
See Note 3 for analyzing murine data. While SVR is robust to
unknown cell populations, performance can be adversely affected
by genes that are highly expressed in a relevant unknown cell
population (e.g., in the malignant cells) but not by any immune
components present in the signature matrix. A simple option imple-
mented in CIBERSORT to limit this effect is to remove genes
highly expressed in non-hematopoietic cells or tumor cells. If
expression data is available from purified tumor cells for the malig-
nancy to be studied, this can be used as a guideline to filter other
confounding genes from the signature matrix.
3.3.2 Input Data The mixture input data format for custom signature gene matrix
Preparation option is identical to the analysis with the LM22 signature gene
matrix (Subheading 3.2.1). To generate the custom signature gene
matrix, the user needs to provide a reference sample file containing
the GEPs for each purified immune population of interest, and a
phenotype class file assigning the profiles to each phenotypic type of
immune cell to be included in the signature matrix. The expression
data in the reference sample file should be in non-log (i.e., linear)
space with genes listed in the rows and reference populations listed
in columns. The phenotype class file lists the desired cell popula-
tions in the signature matrix listed in rows and the purified refer-
ence samples contained in the reference sample file listed in
columns (refer to the CIBSERORT website manual for more
details). These must be listed in the exact same order as the refer-
ence sample file. The cells are used to assign phenotypic classes to
Table 2
Format of input files to generate reference files and class files necessary for custom gene signatures
(tab separated plain text)
Gene symbol Cell type Cell type Cell type Cell type Cell type Cell type
(required) Name1 Name1 Name1 Name2 Name2 Name2 ...
Gene1
Gene2
...
each purified reference sample. Importantly, all cell types should be

represented by at least two replicates in order to identify genes with
significantly differential expression (see Note 4).
For ease of use, we have created an R script to generate both
intermediate files (the script is available from the CIBERSORT
website). Gene expression data for each purified sample should be
formatted similarly to the mixture input data (see Table 2) and each
replicate of the same cell type must be labeled with the identical
phenotypic class name. To run the script, execute the following
command:
Rscript generate_ref_and_class.R your_input_mixture_file.txt
The script will produce two output files, both of which are
required to build a signature matrix: class_file.input.txt (i.e., phe-
notype class file) and reference_file.input.txt (i.e., reference
sample file).
3.3.3 Creating the In the following two sections, we describe how to create a custom
Signature Matrix leukocyte signature matrix and apply it to study cellular heteroge-
neity and TIL survival associations in melanoma tumors profiled by
The Cancer Genome Atlas (TCGA). Readers can follow along by
creating “LM6,” a leukocyte RNA-Seq signature matrix comprised
of six peripheral blood immune subsets (B cells, CD8 T cells, CD4
T cells, NK cells, monocytes/macrophages, neutrophils;
GSE60424 [21]). Key input files are provided on the CIBERSORT
website (“Menu>Download”).
A custom signature file can be created by uploading the Refer-
ence sample file and the Phenotype classes file (Subheading 3.3.2)
to the online CIBERSORT application (see Fig. 2) or can be created
using the downloadable Java package. To build a custom gene
signature matrix with the latter, the user should download the
Java package from the CIBERSORT website and place all relevant
files under the package folder. To link Java with R, run the follow-
ing in R:
Within R:
> library(Rserve)
> Rserve(args¼"--no-save")
Command line:
> java -Xmx3g -Xms3g -jar CIBERSORT.jar -M Mixture_file -P
Reference_sample_file -c phenotype_class_file -f
The last argument (-f) will eliminate non-hematopoietic genes
from the signature matrix and is generally recommended for signa-
ture matrices tailored to leukocyte deconvolution. The user can also
run this step on the website by choosing the corresponding refer-
ence sample file and phenotype class file (see Fig. 2). The CIBER-
SORT website will generate a gene signature matrix located under
“Uploaded Files” for future download.
Following signature matrix creation, quality control measures
should be taken to ensure robust performance (see “Calibration of
in silico TIL profiling methods” in Newman et al.) [18]. Factors
that can adversely affect signature matrix performance include poor
input data quality, significant deviations in gene expression between
cell types that reside in different tissue compartments (e.g., blood
versus tissue), and cell populations with statistically indistinguish-
able expression patterns. Manual filtering of poorly performing
genes in the signature matrix (e.g., genes expressed highly in the
tumor of interest) may improve performance.
To benchmark our custom leukocyte matrix (LM6), we com-
pared it to LM22 using a set of TCGA lung squamous cell carci-
noma tumors profiled by RNA-Seq and microarray (n ¼ 130 pairs).
Deconvolution results were significantly correlated for all cell sub-
sets shared between the two signature matrices ( p < 0.0001).
Notably, since LM6 was derived from leukocytes isolated from
peripheral blood [21, 22], we restricted the CD4 T cell comparison
to naive and resting memory CD4 T cells in LM22. Once validation
is complete, a CIBERSORT signature matrix can be broadly
applied to mixture samples as described in Subheading 3.3 (e.g.,
see Fig. 4).
3.4 Correlating TIL Associations with clinical indices and outcomes are commonly
Levels with Clinical assessed using a log-rank test for binary variables and Cox propor-
Outcomes tional hazards regression for continuous variables. There are a
number of freely available tools for such analyses. We typically use
the R “survival” package or the python “lifelines” package. To
illustrate TIL survival analysis in primary tumor samples, we applied
LM6 (Subheading 3.3.3) to 473 TCGA skin cutaneous melanoma
tumor samples profiled by RNA-Seq (see Fig. 4). We then analyzed
the influence of estimated CD8 T cell levels on overall survival.
Higher levels of CD8 T lymphocytes were associated with favorable
overall survival in both dichotomous (Fig. 5) and continuous mod-
els ( p ¼ 0.013, Cox regression), consistent with previous studies
[1, 2].
Cutaneous Melanoma (TCGA)

1.0
High CD8 T cells
Low CD8 T cells
0.8
Logrank P = 0.001
HR = 0.76 [0.64–0.9]
Overall survival
0.6 n = 364 tumors
0.4
0.2
0
0 50 100 150 200 250 300 350
Time (months)
Fig. 5 Association between inferred tumor-infiltrating CD8 T cell content and

overall survival in patients with skin cutaneous melanoma profiled by TCGA
(related to Subheading 4). Estimated CD8 T cell levels were stratified by a
median split, and the separation between survival curves was evaluated using
a log-rank test. Only patients with available survival data and with a significant
CIBERSORT p-value (<0.05) were considered for this analysis (n ¼ 364). HR,
hazard ratio. 95% confidence intervals for the hazard ratio are shown in brackets
3.5 Use of By default, CIBERSORT estimates the relative fraction of each cell
CIBERSORT to Infer type in the signature matrix, such that the sum of all fractions is
Absolute TIL Levels equal to 1 for a given mixture sample. CIBERSORT can also be
used to produce a score that quantitatively measures the overall
abundance of each cell type (as described in “Analysis of deconvo-
lution consistency” in Newman et al.) [17]. Briefly, the absolute
immune fraction score is estimated by the median expression level
of all genes in the signature matrix divided by the median expres-
sion level of all genes in the mixture. Using this metric coupled with
LM22, we have found that CIBERSORT effectively captures over-
all immune content in RNA-Seq and microarray datasets when
benchmarked against other methods. These include H&E staining
and computational inference by ESTIMATE [23], a previously
published method for determining overall immune content in
tumor expression profiles.
Absolute results can be easily accessed from the CIBERSORT
website by toggling the output between relative and absolute
modes in the Results page (see online manual for details). When
using the R script (Subheading 3.2.3), the user should download
the latest version of the script and set “absolute¼TRUE.” For
example:
results <- CIBERSORT(’sig_matrix_file.txt’,’mixture_file.txt’,
perm¼100, absolute¼TRUE)
3.6 Conclusion CIBERSORT is an in silico approach for characterizing cell subsets

of interest in high-dimensional genomic data derived from bulk
tissue samples. Given a validated signature matrix, CIBERSORT
can profile compositional differences in a standardized manner,
facilitating robust and reproducible analyses of cellular heterogene-
ity in both newly measured and archived genomic datasets, fresh/
frozen tissue biopsies, and fixed clinical specimens. Since CIBER-
SORT is platform agnostic, it can be applied to diverse genomic
data types other than mRNA, including DNA methylation, micro-
RNA, proteomic, and chromatin accessibility profiles. CIBER-
SORT is therefore a versatile framework for tissue
characterization, with applications for identifying predictive and
prognostic cellular biomarkers, and novel therapeutic targets.
4 Notes
1. CIBERSORT is platform agnostic and can be applied to any

genomic admixture that satisfies its mathematical model (Sub-
heading 1.2), including mixtures profiled by RNA-Seq.
Although LM22 was derived and originally validated using
microarray data, we have observed significant correlations for
most of LM22 populations on paired microarray/RNA-Seq
TCGA datasets, suggesting that it is reasonable to apply LM22
to RNA-Seq data for hypothesis generation. Nevertheless, if
significant subsets of genes within LM22 are not present in the
RNA-Seq summarization, the deconvolution of the
corresponding cell types may be adversely affected. To avoid
such potential degradation of deconvolution, we strongly rec-
ommend including as many genes as possible within LM22
(e.g., components of BCR and TCR genes). Separately, it has
been noted that the RNA-Seq mixture samples analyzed by the
LM22 matrix will have a higher frequency of samples with
p-values above 0.05. This is largely due to the differing dynamic
range of RNA-Seq and microarray data, and may not accurately
reflect the quality of the deconvolution results. Users should
therefore exercise caution in interpreting cross-platform p-
values. An RNA-Seq derived signature matrix analogous to
LM22 is currently being developed with an expanded set of
immune populations by the authors.
2. HUGO gene symbols are required as input when the LM22
signature matrix is used. However, CIBERSORT is not
restricted to HUGO gene symbols, and users working with
custom gene signatures can employ any set of unique alphanu-
meric identifiers, provided they are consistent between the sig-
nature matrix and the mixture file. When a user is not using
HUGO gene symbols, the non-hematopoietic gene filtering
functions will not work since these lists are represented in

HUGO format.
3. Applying the LM22 matrix to a murine tumor may be unreliable
due to cross-species differences in immune biology. A user
working with murine data should consider building a custom
signature matrix with either publicly available data (e.g., Imm-
Gen; https://www.immgen.org/) or in-house data.
4. The CIBERSORT model builds a gene signature matrix by
minimizing gene expression variance within the same cell type
and by maximizing variance between cell types; it is therefore
important to use data replicates. Cell types should be isolated
from the same tissue type or culture conditions, and biological
replicates are recommended to help the model capture donor-
to-donor variations. To increase statistical power, we recom-
mend using three or more replicates for each cell subset.
Acknowledgments
We would like to thank David Steiner, M.D., Ph.D. for his assis-
tance in generating the RNA-Seq derived signature matrix. This
work is supported by grants from the Doris Duke Charitable Foun-
dation (A.A.A.), the Damon Runyon Cancer Research Foundation
(A.A.A.), the B&J Cardan Oncology Research Fund (A.A.A.), the
Ludwig Institute for Cancer Research (A.A.A.), NIH grant
1K99CA187192-01A1 (A.M.N.), NIH grant PHS NRSA 5T32
CA09302-35 (A.M.N.), US Department of Defense grant
W81XWH-12-1-0498 (A.M.N.), a grant from the Siebel Stem
Cell Institute and the Thomas and Stacey Siebel Foundation
(A.M.N.), an NIH/Stanford MSTP training grant (B.C.), and a
PD Soros Fellowship (B.C.).
References
1. Fridman WH, Pagès F, Sautès-Fridman C, Carmona M, Kivork C, Seja E, Cherry G,
Galon J (2012) The immune contexture in Gutierrez AJ, Grogan TR, Mateus C,
human tumours: impact on clinical outcome. Tomasic G, Glaspy JA, Emerson RO,
Nat Rev Cancer 12(4):298–306. https://doi. Robins H, Pierce RH, Elashoff DA,
org/10.1038/nrc3245 Robert C, Ribas A (2014) PD-1 blockade
2. Gentles AJ, Newman AM, Liu CL, Bratman induces responses by inhibiting adaptive
SV, Feng W, Kim D, Nair VS, Xu Y, immune resistance. Nature 515
Khuong A, Hoang CD, Diehn M, West RB, (7528):568–571. https://doi.org/10.1038/
Plevritis SK, Alizadeh AA (2015) The prognos- nature13954
tic landscape of genes and infiltrating immune 4. Herbst RS, Soria JC, Kowanetz M, Fine GD,
cells across human cancers. Nat Med 21 Hamid O, Gordon MS, Sosman JA, McDer-
(8):938–945. https://doi.org/10.1038/nm. mott DF, Powderly JD, Gettinger SN, Kohrt
3909 HE, Horn L, Lawrence DP, Rost S,
3. Tumeh PC, Harview CL, Yearley JH, Shintaku Leabman M, Xiao Y, Mokatrin A,
IP, Taylor EJ, Robert L, Chmielowski B, Koeppen H, Hegde PS, Mellman I, Chen DS,
Spasic M, Henry G, Ciobanu V, West AN, Hodi FS (2014) Predictive correlates of
response to the anti-PD-L1 antibody mechanistic signatures. Immunity 39

MPDL3280A in cancer patients. Nature 515 (1):11–26. https://doi.org/10.1016/j.
(7528):563–567. https://doi.org/10.1038/ immuni.2013.07.008
nature14011 12. Galon J, Costes A, Sanchez-Cabo F,
5. Ji RR, Chasalow SD, Wang L, Hamid O, Kirilovsky A, Mlecnik B, Lagorce-Pagès C,
Schmidt H, Cogswell J, Alaparthy S, Tosolini M, Camus M, Berger A, Wind P,
Berman D, Jure-Kunkel M, Siemers NO, Jack- Zinzindohoué F, Bruneval P, Cugnenc PH,
son JR, Shahabi V (2012) An immune-active Trajanoski Z, Fridman WH, Pagès F (2006)
tumor microenvironment favors clinical Type, density, and location of immune cells
response to ipilimumab. Cancer Immunol within human colorectal tumors predict clinical
Immunother 61(7):1019–1031. https://doi. outcome. Science 313(5795):1960–1964.
org/10.1007/s00262-011-1172-6 https://doi.org/10.1126/science.1129139
6. Tung JW, Heydari K, Tirouvanziam R, Parks 13. Tosolini M, Kirilovsky A, Mlecnik B,
DR, Herzenberg LA, Herzenberg LA (2007) Fredriksen T, Mauger S, Bindea G, Berger A,
Modern flow cytometry: a practical approach. Bruneval P, Fridman WH, Pagès F, Galon J
Clin Lab Med 27(3):453. https://doi.org/10. (2011) Clinical impact of different classes of
1016/j.cII.2007.05.001 infiltrating T cytotoxic and helper cells (Th1,
7. Abbas AR, Wolslegel K, Seshasayee D, th2, treg, th17) in patients with colorectal can-
Modrusan Z, Clark HF (2009) Deconvolution cer. Cancer Res 71(4):1263–1271. https://
of blood microarray data identifies cellular acti- doi.org/10.1158/0008-5472.can-10-2907
vation patterns in systemic lupus erythemato- 14. Verhaak RG, Tamayo P, Yang JY, Hubbard D,
sus. PLoS One 4(7):e6098. https://doi.org/ Zhang H, Creighton CJ, Fereday S,
10.1371/journal.pone.0006098 Lawrence M, Carter SL, Mermel CH, Kostic
8. Curtis C, Shah SP, Chin SF, Turashvili G, AD, Etemadmoghadam D, Saksena G,
Rueda OM, Dunning MJ, Speed D, Lynch Cibulskis K, Duraisamy S, Levanon K,
AG, Samarajiwa S, Yuan Y, Gr€a f S, Ha G, Sougnez C, Tsherniak A, Gomez S,
Haffari G, Bashashati A, Russell R, Onofrio R, Gabriel S, Chin L, Zhang N, Spell-
McKinney S, Langerød A, Green A, man PT, Zhang Y, Akbani R, Hoadley KA,
Provenzano E, Wishart G, Pinder S, Kahn A, Köbel M, Huntsman D, Soslow RA,
Watson P, Markowetz F, Murphy L, Ellis I, Defazio A, Birrer MJ, Gray JW, Weinstein JN,
Purushotham A, Børresen-Dale AL, Brenton Bowtell DD, Drapkin R, Mesirov JP, Getz G,
JD, Tavaré S, Caldas C, Aparicio S, Group M Levine DA, Meyerson M, Network CGAR
(2012) The genomic and transcriptomic archi- (2013) Prognostically relevant gene signatures
tecture of 2,000 breast tumours reveals novel of high-grade serous ovarian carcinoma. J Clin
subgroups. Nature 486(7403):346–352. Invest 123(1):517–525. https://doi.org/10.
https://doi.org/10.1038/nature10983 1172/jci65833
9. Ascierto ML, Kmieciak M, Idowu MO, 15. Liebner DA, Huang K, Parvin JD (2014)
Manjili R, Zhao Y, Grimes M, Dumur C, MMAD: microarray microdissection with anal-
Wang E, Ramakrishnan V, Wang XY, Bear ysis of differences is a computational tool for
HD, Marincola FM, Manjili MH (2012) A deconvoluting cell type-specific contributions
signature of immune function genes associated from tissue samples. Bioinformatics 30
with recurrence-free survival in breast cancer (5):682–689. https://doi.org/10.1093/bioin
patients. Breast Cancer Res Treat 131 formatics/btt566
(3):871–880. https://doi.org/10.1007/ 16. Zhong Y, Wan YW, Pang K, Chow LM, Liu Z
s10549-011-1470-x (2013) Digital sorting of complex tissues for
10. Mann GJ, Pupo GM, Campain AE, Carter CD, cell type-specific gene expression profiles. BMC
Schramm SJ, Pianova S, Gerega SK, De Silva C, Bioinformatics 14:89. https://doi.org/10.
Lai K, Wilmott JS, Synnott M, Hersey P, Kef- 1186/1471-2105-14-89
ford RF, Thompson JF, Yang YH, Scolyer RA 17. Newman AM, Liu CL, Green MR, Gentles AJ,
(2013) BRAF mutation, NRAS mutation, and Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh
the absence of an immune-related expressed AA (2015) Robust enumeration of cell subsets
gene profile predict poor outcome in patients from tissue expression profiles. Nat Methods
with stage III melanoma. J Invest Dermatol 12(5):453–457. https://doi.org/10.1038/
133(2):509–517. https://doi.org/10.1038/ nmeth.3337
jid.2012.283 18. Newman AM, Alizadeh AA (2016) High-
11. Galon J, Angell HK, Bedognetti D, Marincola throughput genomic profiling of tumor-
FM (2013) The continuum of cancer immuno- infiltrating leukocytes. Curr Opin Immunol
surveillance: prognostic, predictive, and
41:77–84. https://doi.org/10.1016/j.coi. PLoS One 9(10):Artn E109760. https://doi.

2016.06.006 org/10.1371/Journal.Pone.0109760
19. Scholkopf B, Smola AJ, Williamson RC, Bart- 22. Sleasman JW, Leon BH, Aleixo LF, Rojas M,
lett PL (2000) New support vector algorithms. Goodenow MM (1997) Immunomagnetic
Neural Comput 12(5):1207–1245 selection of purified monocyte and lymphocyte
20. Corces MR, Buenrostro JD, Wu BJ, Greenside populations from peripheral blood mononu-
PG, Chan SM, Koenig JL, Snyder MP, Pritch- clear cells following cryopreservation. Clin
ard JK, Kundaje A, Gkeenleaf WJ, Majeti R, Diagn Lab Immunol 4(6):653–658
Chang HY (2016) Lineage-specific and 23. Yoshihara K, Shahmoradgoli M, Martinez E,
single-cell chromatin accessibility charts Vegesna R, Kim H, Torres-Garcia W,
human hematopoiesis and leukemia evolution. Trevino V, Shen H, Laird PW, Levine DA,
Nat Genet 48(10):1193–1203. https://doi. Carter SL, Getz G, Stemke-Hale K, Mills GB,
org/10.1038/ng.3646 Verhaak RGW (2013) Inferring tumour purity
21. Linsley PS, Speake C, Whalen E, Chaussabel D and stromal and immune cell admixture from
(2014) Copy number loss of the interferon expression data. Nat Commun 4:Artn 2612.
gene cluster in melanomas is linked to reduced https://doi.org/10.1038/Ncomms3612
T cell infiltrate and poor patient prognosis.
Chapter 13
Systems Biology Approaches in Cancer Pathology

Aaron DeWard and Rebecca J. Critchley-Thorne
Abstract
The complex network of the tissue system, in both pre-neoplastic tissues and tumors, demonstrates the
need for a systems biology approach to cancer pathology, in which quantification of key tissue system
processes is combined with informatics tools to produce actionable scores to aid clinical decision-making. A
systems biology approach to cancer pathology enables integration of key system features that are relevant to
diagnoses, patient outcomes, and responses to therapies. Key tissue system features relevant to cancer
pathology include molecular and morphologic abnormalities in epithelia, cellular changes in the stroma
such as immune infiltrates, and relationships between components of the system, such as interactions and
spatial relationships between epithelial and stromal components, and also between specific immune cell
subsets. Here, we describe a method for objective quantification of multiple epithelial and stromal bio-
markers in the context of tissue architecture to generate a high dimensional tissue profile that can be used to
build multivariable predictive models for cancer pathology.
Key words Biomarkers, Multiplexed immunofluorescence, Whole slide fluorescence imaging, Digital
pathology, Quantitative image analysis, Cancer systems biology
1 Introduction
Current pathology methods for the assessment of pre-neoplastic

and neoplastic tissues have been valuable for many decades, but are
limited by subjectivity and observer variability. Digital slide scan-
ning and algorithms for automated scoring of biomarkers stained
by immunohistochemistry are gaining traction in clinical labora-
tories, which will improve workflows and reduce variability
[1, 2]. However, the majority of current biomarkers used in cancer
pathology testing are markers of epithelial cell processes and
abnormalities. The complexity of the tissue system and the impor-
tant roles of stromal components in the development and progres-
sion of cancer, and in the responses of cancer to therapies, highlight
the need for a systems biology approach to cancer pathology
[3, 4]. Assessment of key tissue system biomarkers can improve
on the current diagnostic tools by creating multivariate profiles that
capture key molecular and cellular features of the tissue
261
262 Aaron DeWard and Rebecca J. Critchley-Thorne
environment, including relationships between biomarkers

[3, 5]. Objective, reproducible measurement of multiple biomar-
kers per slide can be achieved via automated multiplexed immuno-
fluorescence labeling of four or more biomarkers per slide coupled
with standardized whole slide fluorescence scanning. Quantitative
image analysis can be used to automatically extract an array of
quantitative biomarker and morphologic features from whole
slide tissue images, resulting in a multivariate tissue profile. Such
profiles can be mined in samples from patient cohorts with clinic-
pathologic data to identify clinically relevant signatures, and to
build diagnostic, prognostic, and predictive models. Here we
describe methods for automated immunofluorescent labeling of
multiple biomarkers in tissue slides, and standardized whole slide
fluorescence imaging to produce tissue images with multiple
registered channels of biomarker signal and morphology data.
These composite images can be analyzed by tissue image analysis
platforms to generate quantitative data on key tissue system features
and processes. A systems biology approach to assess multiple epi-
thelial and stromal biomarkers in Barrett’s esophagus biopsies is
discussed as an application of the method.
2 Materials
2.1 Multiplexed 1. Slides prepared with 5 μm sections of formalin-fixed paraffin-

Immunofluorescence embedded (FFPE) tissue (see Note 1).
Slide Labeling 2. BondRX autostainer (Leica BioSystems).
Procedure
3. Bond Dewax Solution (Leica BioSystems).
4. 100% ethanol (reagent grade).
5. Bond Epitope Retrieval (ER) Solution 2 (Leica BioSystems).
6. Bond Wash Solution 10 (Leica BioSystems).
7. Image-iT® FX Signal Enhancer (Thermo Fisher).
8. Blocking buffer: Tris-buffered saline, 5% goat serum, 1% glyc-
erol, 0.1% bovine serum albumin, 0.1% cold water fish skin
gelatin, 0.04% sodium azide (see Note 2).
9. Primary antibody cocktails: primary antibodies in blocking
buffer at a dilution predetermined by titration to produce
optimal signal:noise (see Notes 3 and 4).
10. Secondary antibody cocktails: Alexa Fluor®-conjugated sec-
ondary antibodies raised in goat, specific to the species and
species isotypes of the primary antibody cocktail (see Notes 5
and 6).
11. Hoechst 33342 or equivalent label that emits blue fluorescence
when bound to double-stranded DNA.
Tissue Systems Pathology 263
12. Deionized water.

13. Prolong Gold Antifade Mountant (Thermo Fisher), or similar
aqueous mounting medium containing components to protect
against photobleaching.
14. Glass coverslips (#1.5).
15. Lens cleaner.
16. Clear nail polish.
2.2 Whole Slide 1. ScanScope® FL (Leica BioSystems) equipped with:

Fluorescence (a) BrightLine® Pinkel quadband filter set optimized for
Scanning DAPI, FITC, TRITC, & Cy5 (FF01-440/521/607/
700-25).
(b) BrightLine® single-band bandpass excitation filters FF01-
387/11-25, FF01-485/20-25, FF01-560/25-25 and
FF01-650/13-25 (Semrock).
(c) X-Cite® exacte light source (Lumen Dynamics/Excelitas
Technologies Corp.).
(d) Light source calibration device (see Note 7) (Lumen
Dynamics/Excelitas Technologies Corp.).
2. TetraSpeck Fluorescent Microspheres.
3. FocalCheck Fluorescent Microsphere.
4. ImageScope software (Leica BioSystems).
2.3 Quantitative 1. TissueCypher® Image Analysis Platform (Cernostics).

Image Analysis 2. Image Processing Toolbox™ (MathWorks [6]).
3. ImageScope software (Leica BioSystems).
3 Methods
3.1 Multiplexed Program the BondRX autostainer to perform the following steps
Immunofluorescence (application volume is 150 μL for all the reagent steps):
Slide Labeling
1. Bake slides (no reagent), incubation time 30 min, 60 C.
Procedure
2. Apply Bond Dewax solution, incubation time 30 min, 72 C.
3. Reapply Bond Dewax solution, incubation time 0 s, 72 C.
4. Reapply Bond Dewax solution, incubation time 0 s, ambient
temperature (see Note 8).
5. Apply ethanol, incubation time 0 s, ambient temperature,
repeat twice for total of three ethanol washes.
6. Apply Bond Wash (diluted to 1 with deionized water), incu-
bation time 0 s, ambient temperature, repeat for total of three
washes with Bond Wash.
7. Apply Bond ER Solution 2, incubation time 0 s, ambient

temperature, repeat once.
8. Apply Bond ER Solution 2, incubation time 30 min,
98–100 C (see Note 9).
9. Apply Bond ER Solution 2, incubation time 0 s, ambient
temperature.
10. Apply Bond Wash, incubation time 0 s, ambient temperature,
repeat for total of four washes with Bond Wash.
11. Apply Bond Wash, incubation time 0 s, ambient temperature.
12. Apply Image-iT FX, incubation time 30 min ambient
temperature.
repeat for total of three washes with Bond Wash.
14. Apply blocking buffer, incubation time 30 min, ambient
temperature.
repeat for a total of three washes.
16. Apply primary antibody cocktail, incubation time 60 min,
ambient temperature.
17. Apply Bond Wash, incubation time 0 s–1 min, ambient tem-
perature, repeat for total of 3–10 washes with Bond Wash (see
Note 10).
18. Apply secondary antibody cocktail, incubation time 60 min,
ambient temperature.
19. Apply Bond Wash, incubation time 0 s–1 min, ambient tem-
perature, repeat for total of 3–10 washes with Bond Wash (see
Note 10).
20. Apply Hoechst 33342, incubation time 3 min, ambient
temperature.
21. Apply deionized water, incubation time 0 s, ambient tempera-
ture, repeat for a total of three washes.
22. Remove the slides immediately, wipe excess water from slides,
allow sections to air dry at ambient temperature protected from
light. Once dry mount the slides with aqueous mounting
medium and allow curing for at least 12 h at room temperature
protected from light (see Note 11).
3.2 Whole Slide 1. Calibrate the light source to steady absolute output using the
Fluorescence X-Cite® XR2100 Power Meter. An output of 2.2 W will ensure
Scanning adequate illumination for most imaging applications that can
be maintained for 1500–2500 h of scanning depending on the
initial attainable wattage of the bulb.
2. Follow the manufacturer’s procedure for producing whole

slide scans with four registered channels of image data,
including:
(a) Image Hoechst (or equivalent) in the FF01-387/11-25
(or equivalent) channel, Alexa Fluor 488 in the FF01-
485/20-25 channel, Alexa Fluor 555 in the FF01-560/
25-25 channel and Alexa Fluor 647 in the FF01-650/13-
25 channel (see Note 12). The example images of the p16,
AMACR, p53 biomarker panel and HIF-1alpha,
CD45RO, CD1a biomarker panel in Barrett’s esophagus
biopsies are shown in Fig. 1.
(b) Optimize exposure times on a test set of known negative,
intermediate and high controls for each biomarker. Main-
tain consistent exposure times for all channels on all
patient samples (see Note 11).
(c) Review images for quality including focus, even illumina-
tion across imaging stripes/seams, artifacts, etc., and
rescan if necessary to produce high-quality tissue images
for quantitative image analysis.
(d) Verify channel registration periodically using slides mounted
with FocalCheck Fluorescent Microspheres (see Note 13).
(e) Verify scanner precision periodically, after bulb calibra-
tion, and after the replacement of the bulb or light
guide, using slides mounted with TetraSpeck Fluorescent
Microspheres (see Note 14).
3.3 Quantitative We utilize the TissueCypher® Image Analysis Platform, which

Image Analysis includes a high performance file reading mechanism based on
BigTiff format to decode raw image data, MatLab algorithms for
segmenting low level tissue objects such as nuclei, cytoplasm,
plasma membrane, and whole cells to allow feature collection at
the cellular and subcellular level. It further contains higher order
computer vision models for spatial quantification of biomarkers in
tissue compartments, such as epithelium and lamina propria, as
described by Prichard et al. [4]. There are multiple commercially
available tools for quantitative analysis of digital tissue slide images,
such as the Image Processing Toolbox™ that provides algorithms
and functions for image processing, image analysis and develop-
ment of algorithms for application-specific features.
Image analysis to create a multivariable tissue systems profile
should include:
1. Handling of image artifacts. Artifacts such as bubbles, folds,
fibers, and out of focus regions can be removed via manual
annotation of images or algorithms [7, 8] (see Note 15).
Fig. 1 Representative images of multiplexed panels of tissue system biomarkers in Barrett’s esophagus pinch
biopsies. Sections of Barrett’s esophagus pinch biopsies were fluorescently immunolabeled for the multi-
plexed panels of biomarkers described in Notes 4 and 6. Whole slide images were acquired at 20
magnification using the ScanScope FL. (Panels a–d) (a) HIF-1α-green (b) CD45RO-red, (c) CD1a-yellow,
(d) HIF-1α-green, CD1a-yellow overlay demonstrating infiltration of the lamina propria by cells expressing
HIF-1α, which indicates stromal angiogenesis, and also memory lymphocytes and dendritic cells. (Panels
e–h) (e) HIF-1α-green (f) CD45RO-red, (g) CD1a-yellow, (h) HIF-1α-green, CD45RO-red, CD1a-yellow overlay,
providing an additional example of infiltration of the lamina propria by cells expressing HIF-1α, memory
lymphocytes and dendritic cells. (Panels i–l) (i) p16-green, (j) AMACR-red, (k) p53-yellow, (l) p16-green,
AMACR-red, p53-yellow overlay showing loss of p16, focal overexpression of AMACR and overexpression of
p53. (Panels m–p) (m) p16-green, (n) AMACR-red, (o) p53-yellow, (p) p16-green, AMACR-red, p53-yellow
overlay showing normal/positive expression of p16, multi-focal overexpression of AMACR and loss of p53.
Hoechst shown in blue in all panels
Fig. 2 Cellular object segmentation and tissue structure segmentation to enable quantitative, contextual
feature measurements. The TissueCypher® Image Analysis Platform was used to detect a Barrett’s esophagus
biopsy and segment subcellular compartments and tissue objects. (a) Barrett’s esophagus biopsy labeled for
p16 (green), AMACR (red), p53 (yellow), and Hoechst (blue). (b) Segmentation of nuclei objects based on the
Hoechst channel. (c) Segmentation of cell objects containing nuclei by first creating a distance map to which
the watershed operation was applied, and then performing connected components labeling, as previously
described [4]. (d) Segmentation of cytoplasm by subtracting the nuclei mask shown in Panel b from the cell
mask shown in Panel c. (e) A nuclei cluster mask was produced via Gaussian smoothing of the Hoechst signal,
rank order filter, image thresholding, morphological operations, and connected components labeling, as
previously described [4]. (f) p53 signal (yellow) was measured within the segmented nuclei clusters
2. Segmentation of low level objects such as nuclei (based on

Hoechst signal), cytoplasm, plasma membrane, and whole
cells. Examples of object segmentation masks are shown in
Fig. 2. Object segmentation allows collection of quantitative
biomarker feature data at the cellular and subcellular levels,
which in turn allows calculation of basic intensity measure-
ments on biomarkers (mean, sum, standard deviation,
moment, etc.), co-expression of multiple biomarkers, ratios of
biomarkers between subcellular compartments, gating on sub-
populations of cells with overexpression/lack of expression of
multiple biomarkers, spatial arrangements of cells expressing
1 or more biomarkers, texture, nuclear morphology, etc. [4]
(see Note 16). Examples of cell object-based features extracted
from the biomarker panels described in this method include
p53 mean intensity in nuclei objects, and nuclear area in cell
objects with p16-loss and p53-overexpression. Both the fea-
tures have been shown to have diagnostic significance in
distinguishing between Barrett’s esophagus biopsies with high

grade dysplasia versus non-dysplastic reactive atypia. This sepa-
ration has prognostic significance in predicting risk of future
progression in patients with Barrett’s esophagus [4, 5]. Addi-
tional examples include CD45RO sum intensity in plasma
membrane structures, using two-dimensional anisotropic dif-
fusion, histogram equalization, and conversion to binary using
the CD45RO signal [4], which has prognostic significance in
Barrett’s esophagus [5].
3. Computer vision models for segmentation of tissue struc-
tures and components, such as epithelium, lamina propria,
tumor nests, etc. Computer vision models allow localization of
biomarker signals to specific compartments and collection of
feature data in the context of tissue architecture [4]. Examples
of computer vision model/tissue structure-based features with
diagnostic and/or prognostic significance in Barrett’s esophagus
include mean intensity of p53 in nuclei clusters. A nuclei cluster
mask can be developed using the Hoechst (or equivalent dye)
signal as we have previously described in detail [4]. An example
nuclei cluster mask is shown in Fig. 2. Features derived from cell-
based objects can also be localized to rectangular regions of
tissue images to create microenvironment-based features that
capture localized or focal biomarker abnormalities. Such features
collected across whole slides can be summarized to quantify the
cell-object biomarker features in, for example, the top scoring
5% of microenvironments on each slide. The size of the rectan-
gular regions should be optimized to the specific application; we
used regions of 161 161 pixels to capture focal overexpression
of AMACR, and to detect clusters of stromal cells expressing
HIF-1alpha in Barrett’s esophagus biopsies, both of which have
diagnostic and prognostic significance [4, 5].
4. Statistical analysis of image-derived features. Image analysis
as described above generates multiple measurements per bio-
marker and when applied to multiplexed panels of biomarkers
will generate a high dimensional feature data set. When per-
formed on samples from an appropriately designed patient
cohort with corresponding clinicopathologic data, the high
dimensional data can be mined with the aid of bioinformatics
to identify quantitative features relevant to diagnosis, progno-
sis, and responses to therapies. Combinations of relevant fea-
tures can be used to build multivariable diagnostic, prognostic,
and predictive models that integrate data on key tissue system
processes to produce clinically actionable information. We have
previously described an application of this approach in detail
[5], in which 13,538 quantitative image analysis features
extracted from 14 candidate protein-based biomarkers and
Hoechst were mined in a training cohort of Barrett’s esopha-
gus patients with clinical outcome data in order to identify
prognostic features. A risk prediction model was built that

integrated 15 of the prognostic features, which were derived
from nine of the candidate protein biomarkers and Hoechst,
into an individualized risk score that is correlated with risk of
future progression to high grade dysplasia or esophageal ade-
nocarcinoma. The pre-specified risk prediction model was vali-
dated on an independent, multi-institutional cohort of patients
with Barrett’s esophagus, demonstrating significant risk strati-
fication of patients who progressed and patients who did not
progress to high-grade dysplasia or EAC, and showing prog-
nostic power that was independent of current clinical variables,
including pathologic diagnosis provided by a gastrointestinal
subspecialist [5].
Sample workflow to generate and apply quantitative feature
data from AMACR, one of several representative biomarkers used
for risk stratification in Barrett’s Esophagus:
1. Open ImageScope software to view the digital image of a slide
containing fluorescently labeled AMACR. Author annotations
to remove artifacts and/or select regions of interest using the
pen tool available within the software. For example, dust fibers
brightly autofluoresce, and have the potential to be interpreted
as positive signal. Annotating out a dust fiber will prevent its
incorporation into subsequent image analysis.
2. Focal overexpression of AMACR in Barrett’s Esophagus tissue
is correlated with an increased risk of disease progression,
whereas no/low expression is associated with a low risk of
progression. An example image containing focal AMACR
expression is shown in Fig. 1. The TissueCypher® Image Anal-
ysis Platform reads the annotated digital slide image, detects
tissue fragments, and segments cell-based objects (Fig. 2). To
quantify focal AMACR expression the software separates the
whole image into 161 161 pixel tiles/microenvironments.
The TissueCypher® Image Analysis software quantifies the
fluorescence intensity of AMACR in the cell-based objects,
e.g., plasma membrane objects, in each tile. The fluorescence
intensity can be quantified as mean, sum, standard deviation,
nth percentile, etc. The top five tiles, or top 5% of tiles based on
highest AMACR intensity, are averaged to generate a feature
value for AMACR.
3. The feature derived from AMACR can be evaluated in a patient
cohort with corresponding clinicopathologic data, e.g., a
cohort including patients whose Barrett’s esophagus pro-
gressed to esophageal adenocarcinoma (cases) and patients
who Barrett’s esophagus did not progress during surveillance
(controls). Conditional logistic regression or Cox regression
can be used to compare the feature in cases versus controls, and
to return a coefficient to weigh the feature in a univariate risk

prediction analysis. The feature can also be entered into multi-
variable model building along with features derived from other
biomarkers, as described above.
4 Notes
1. 5 μm is the optimal section thickness for tissue image object

segmentation since it is thick enough to include an optimal
number of whole nuclei, yet thin enough to avoid too many 3D
overlaps. Prior to labeling store slides at 2–10 C under vac-
uum to protect epitopes.
2. The serum type in the blocking buffer should match the species
in which the secondary antibodies were generated. We use
secondary antibodies raised in goat and thus use blocking
buffer containing goat serum. The concentrations of blocking
buffer components such as BSA, glycerol, cold water fish skin
gelatin, and sodium azide can be titrated to minimize nonspe-
cific labeling, depending on the antibodies and tissue-
type used.
3. A range of primary antibody dilutions should be tested on
tissues or cell line controls with known negative, intermediate,
and high expression of the target biomarker. Fluorescence
signal should be quantified in tissue areas/cells with positive
expression (signal) and tissue areas/cells with negative/back-
ground or nonspecific labeling (noise). The signal:noise should
be calculated for each dilution to determine the appropriate
dilution for the specific application, which in our experience is
the dilution that results in signal:noise 5. Use of in vitro
diagnostic (IVD)-labeled antibodies will ensure lot-to-lot
reproducibility. Even with IVD-labeled antibodies, new lots
should be validated to ensure that the antibody specificity and
signal:noise are equivalent between lots. We recommend using
sections of FFPE cell lines on slides with at least 1 negative,
1 intermediate, and 1 high expressing cell line for each bio-
marker assessed. FFPE cell lines can also be utilized as batch
controls in each run of patient slides being labeled, imaged, and
analyzed. A method for the preparation of FFPE cell line con-
trols has been previously described [9].
4. Primary antibodies within a single cocktail must be raised in
different species or different species isotypes. For example:
(a) Mouse IgG2a anti-p16 antibody, rabbit IgG anti-AMACR
antibody, and mouse IgG2b anti-p53 antibody can be multi-
plexed within a cocktail to assess p16, AMACR, and p53
expression on a slide. (b) Rabbit IgG anti-HIF-1alpha
antibody, mouse IgG2a anti-CD45RO antibody, and mouse

IgG1 anti-CD1a antibody can be multiplexed within a cocktail
to assess CD1a, CD45RO, and HIF-1alpha on a slide.
5. Refer to the manufacturer’s product information to ensure that
the secondary antibodies have been highly cross-adsorbed to
minimize cross-reactivity. Protein aggregates may form in the
secondary antibody solutions. Therefore, the cocktail of fluo-
rescently conjugated secondary antibodies should be centri-
fuged at high speed for 3 s prior to use. Only the supernatant
should be applied to slides in order to minimize nonspecific
labeling.
6. Alexa Fluor 488-conjugated goat anti-mouse IgG2a antibody,
Alexa Fluor 555-conjugated goat anti-rabbit IgG, and Alexa
Fluor 647-conjugated goat anti-mouse IgG2b can be prepared
in a cocktail to detect the p16, AMACR, p53 antibody panel
described in Note 4a. Alexa Fluor 488-conjugated goat anti-
rabbit IgG, Alexa Fluor 555-conjugated goat anti-mouse
IgG2a, and Alexa Fluor 647-conjugated goat anti-mouse
IgG1 can be prepared in a cocktail to detect the HIF-1alpha,
CD45RO, and CD1a antibody Note 4b. The fluorescently
conjugated antibodies should be used at a dilution
pre-determined by titration to produce optimal signal:noise.
Dilutions ranging from 1:200–1:400 are used for the second-
ary antibody cocktails described here.
7. The light source should be equipped with a calibration device
to ensure the consistent illumination necessary for quantitative
image analysis of biomarkers and morphology.
8. Ambient laboratory temperature and humidity should be mon-
itored and maintained within an established range. Variations
in these environmental conditions will affect tissue processing
and labeling steps that are performed under ambient condi-
tions, which may increase intra- and inter-run imprecision.
9. The epitope retrieval temperature should be optimized for the
specific panel of primary antibodies used. We use 100 C for
the panel containing p16, AMACR, p53 panel, and 98 C for
the HIF-1a, CD45RO, CD1a panel.
10. Longer washing incubation times and/or increased numbers
of washes can be used to minimize nonspecific labeling where
necessary. We use three 0 s washes post-primary antibody
incubation for the p16, AMACR, p53 biomarker panel and
ten 1 min washes for the HIF-1a, CD45RO, CD1a panel
described here, and three 0 s washes post-secondary antibody
incubation.
11. Proper mounting is essential to generation of high-quality
tissue images suitable for image analysis. Mounting medium
such as Prolong Gold Antifade Mountant should be at room

temperature prior to use. Care should be taken to avoid bub-
bles, fibers, and particulate matter in the mounting medium as
these will result in image artifacts that will interfere with quan-
titative image analysis. Artifacts in tissue images should be
removed via image annotation or algorithms prior to image
analysis. Following mounting store slides horizontally at room
temperature protected from light for at least 12 h to ensure
proper curing of the mounting medium. Seal edges of cover-
slips with clear nail polish. Thoroughly clean the back of the
slide and the outside of the coverslip with lens cleaner prior to
slide scanning. Store fluorescently immunolabeled slides at
2–10 C protected from light.
12. The quadband filter set and single-band bandpass excitation
filters (described under Subheading 2.2) are calibrated to sepa-
rate the DAPI, FITC, TRITC, Cy5, or equivalent fluorophores
as recommended in this protocol (Hoechst 33342, Alexa
Fluors 488, 555, and 647).
13. Correct image registration is necessary for quantitative image
analysis involving measurement of biomarkers across different
fluorescent channels, which is required to generate a systems
profile of a tissue, which may include co-expression of biomar-
kers and spatial relationships between biomarkers that are
imaged and quantified in different fluorescent channels.
14. Scanner precision should be monitored to ensure consistent
excitation within imaging runs and between imaging runs. This
is particularly important for clinical studies that can involve
imaging of hundreds of patients over many months.
15. Whole slide digital images generated on the ScanScope FL
scanner can be annotated to remove artifacts or select regions
of interest using ImageScope software prior to reading into
image analysis software.
16. Image analysis features should be normalized (centered or
standardized) to correct for intra- and inter-run variability.
Acknowledgments
National Cancer Institute of the National Institutes of Health

under Award Number R44CA192416. We thank Lia Reese,
Bruce Campbell, and Kathleen Repa for technical assistance in the
development and validation of the TissueCypher® methodology
described in this chapter.
References
1. Pantanowitz L, Valenstein PN, Evans AJ, Kaplan 5. Critchley-Thorne RJ, Duits LC, Prichard JW,
KJ, Pfeifer JD, Wilbur DC, Collins LC, Colgan Davison JM, Jobe BA, Campbell BB, Repa KA,
TJ (2011) Review of the current state of whole Reese LM, Li J, Diehl DL, Jhala NC, Ginsberg
slide imaging in pathology. J Pathol Inf 2:36 GG, DeMarshall M, Foxwell T, Zaidi AH, Tay-
2. Dennis J, Parsa R, Chau D, Koduru P, Peng Y, lor DL, Rustgi AK, Bergman JJ, Falk GW
Fang Y, Sarode VR (2015) Quantification of (2016) A novel tissue systems pathology test
human epidermal growth factor receptor predicts progression in Barrett’s esophagus
2 immunohistochemistry using the Ventana patients. Cancer Epidemiol Biomark Prev 25
image analysis system: correlation with gene (6):958–968
amplification by fluorescence in situ hybridiza- 6. MathWorks image processing toolbox. https://
tion: the importance of instrument validation for www.mathworks.com/products/image
achieving high (>95%) concordance rate. Am J 7. Kothari S, Phan JH, Wang MD (2013) Eliminat-
Surg Pathol 39(5):624–631 ing tissue-fold artifacts in histopathological
3. Gough A, Lezon T, Faeder J, Chennubhotla C, whole-slide images for improved image-based
Murphy R, Critchley-Thorne R, Taylor DL prediction of cancer grade. J Pathol Inf 4:22
(2014) High content analysis and cellular and 8. Hang W, Phan JH, Bhatia AK, Cundiff CA,
tissue systems biology: a bridge between cancer Shehata BM, Wang MD (2015) Detection of
cell biology and tissue-based diagnostics. In: blur artifacts in histopathological whole-slide
Mendelsohn J, Howley PM, Israel MA, Gray images of endomyocardial biopsies. Conf Proc
JW, Thompson CB (eds) The molecular basis IEEE Eng Med Biol Soc 2015:727–730
of cancer 4th edition, 4th edn. Elsevier, 9. Dolled-Filhart M, McCabe A, Giltnane J,
New York Cregger M, Camp RL, Rimm DL (2006) Quan-
4. Prichard JW, Davison JM, Campbell BB, Repa titative in situ analysis of beta-catenin expression
KA, Reese LM, Nguyen XM, Li J, Foxwell T, in breast cancer shows decreased expression is
Taylor DL, Critchley-Thorne RJ (2015) Tissue- associated with poor outcome. Cancer Res 66
Cypher: a systems biology approach to anatomic (10):5487–5494
pathology. J Pathol Inf 6:48
Part V
Modeling Drug Responses in Cancer Cells

Chapter 14
Bioinformatics Approaches to Predict Drug Responses from

Genomic Sequencing
Neel S. Madhukar and Olivier Elemento
Abstract
Fulfilling the promises of precision medicine will depend on our ability to create patient-specific treatment
regimens. Therefore, being able to translate genomic sequencing into predicting how a patient will respond
to a given drug is critical. In this chapter, we review common bioinformatics approaches that aim to use
sequencing data to predict sample-specific drug susceptibility. First, we explain the importance of custo-
mized drug regimens to the future of medical care. Second, we discuss the different public databases and
community efforts that can be leveraged to develop new methods for identifying new predictive biomar-
kers. Third, we cover the basic methods that are currently used to identify markers or signatures of drug
response, without any prior knowledge of the drug’s mechanism of action. We further discuss how one can
integrate knowledge about drug targets, mechanisms, and predictive markers to better estimate drug
response in a diverse set of samples. We begin this section with a primer on popular methods to identify
targets and mechanism of action for new small molecules. This discussion also includes a set of computa-
tional methods that incorporate other drug features, which do not relate to drug-induced genetic changes
or sequencing data such as drug structures, side-effects, and efficacy profiles. Those additional drug
properties can aid in gaining higher accuracy for the identification of drug target and mechanism of action.
We then progress to discuss using these targets in combination with disease-specific expression patterns,
known pathways, and genetic interaction networks to aid drug choice. Finally, we conclude this chapter
with a general overview of machine learning methods that can integrate multiple pieces of sequencing data
along with prior drug or biological knowledge to drastically improve response prediction.
Key words Bioinformatics, Precision medicine, Drug response, Machine learning, Biomarkers
1 Introduction
One of the greatest challenges in the current paradigm of medicine

is how to deal with patient heterogeneity—both across different
diseases and even within patients diagnosed with the same disease.
Over the past 50 years there have been many studies showing that
patients with the same disease have completely different responses
when treated with the same drug [1–3]. The prevailing hypothesis
to explain the heterogeneous response is each patient’s specific
genetic profile. Precision medicine involves using this patient-
277
278 Neel S. Madhukar and Olivier Elemento
specific genomic information to guide drug treatment, with the

expectation that this will ultimately improve clinical outcomes
[4]. With the decrease in sequencing costs over the past decade, it
is now possible to obtain genomic information for patients prior to
determining a specific treatment regimen. In addition, there has
been an emergence of bioinformatics methods to interpret this
sequencing data and come up with actionable strategies for precise
drug choices. These methods not only allow for the identification
of specific genetic traits that confer susceptibility or resistance to
drug treatment, but can also combine genetic markers with gene
ontologies and biological networks to predict precise response
levels. In this chapter, we provide an overview of these bioinfor-
matics methods, review the basic premises for each type of method,
and discuss some of the current problems and future challenges that
need to be solved. While we tend to focus on cancer, the databases
and methods we described are often applicable to other diseases,
as well.
2 Databases
In recent years, there have been a number of community efforts to

generate and publicly release datasets that could be used to improve
drug response prediction. Table 1 lists some of these datasets. In
this review we will cover what we believe to currently be the best-
suited and most popular public resources for aiding drug response
prediction.
2.1 NCI60 Drug The National Cancer Institute’s (NCI) 60 cell line drug screen is a
Sensitivity Database database of in vitro drug efficacies (either in terms of GI50, LD50,
or TGI) for over 50,000 compounds screened against the NCI60
panel of cancer cell lines [5]. With 60 cancer cell lines from nine
distinct tumor types—leukemia, colon, lung, central nervous sys-
tem, renal, melanoma, ovarian, breast, and prostate—the NCI60
collection aims to provide information on a broad set of genetic
conditions and tumor types. The NCI60 panel has itself been
profiled using a variety of assays from genomic to gene expression
and proteomics [6–9]. The profiling data can be used in conjunc-
tion with the Developmental Therapeutics Program’s (DTP) drug
screening database to identify genetic signatures indicative of a
certain response pattern.
2.2 Cancer Cell Line The Cancer Cell Line Encyclopedia (CCLE) [10, 11] is a database
Encyclopedia of 947 different human cancer cell lines encompassing 36 different
tumor types that have been genetically profiled—gene expression,
copy number, mutations, etc. Furthermore, 24 known anticancer
drugs were profiled against approximately 500 of these cell lines.
Though the number of compounds profiled is smaller than the
Bioinformatics Approaches to Predict Drug Responses from Genomic Sequencing 279
Table 1
List of databases and abbreviations that are mentioned throughout the text of the chapter
Abbreviation Full description Website

GI50 Concentration of a compound that leads to a 50% inhibition of
cell proliferation
IC50 Concentration of a compound that leads to a 50% decrease in
the desired activity
LD50 Concentration of a compound that leads to 50% cell death
TGI Total growth inhibition
GWAS Genome-Wide Association Study
SNP Single-nucleotide polymorphism
DREAM Dialogue on reverse engineering assessment and methods
NCI60-DTP Drug screen of 60 cancer cell lines by the National Cancer https://dtp.cancer.gov
Institute’s (NCI) Developmental Therapeutics Program
(DTP)
CCLE Cancer cell line encyclopedia http://www.
broadinstitute.org/
ccle/home
CMap Connectivity map http://www.
broadinstitute.org/
cmap
GDSC Genomics of drug sensitivity in cancer http://www.
cancerrxgene.org
TCGA The Cancer Genome Atlas http://cancergenome.
nih.gov
GTEx Genotype-tissue expression http://www.gtexportal.
org/home/
NCI60 drug screen, the greater number of cell lines tested allows
for more precise identification of genetic predictors of sensitivity for
the drugs measured.
2.3 Genomics of Hosted by the Wellcome Trust Sanger Institute, the Genomics of
Drug Sensitivity in Drug Sensitivity in Cancer (GDSC) database is a massive drug
Cancer screen project similar to the NCI60 and CCLE. In their initial
release, investigators screened a set of 138 known anti-cancer com-
pounds against over 1000 different cancer cell lines (on average
525 cell lines tested per compound). Each cell line also was sub-
jected to thorough expression and copy number profiling along
with targeted mutation data for a set of 75 cancer genes. This
dataset constitutes another great resource for the identification of
genomic markers of drug responses.
2.4 Connectivity Released by the Broad Institute, the Connectivity Map (CMap)
Map/LINCS seeks to find connections between small molecules, physiological
processes, and disease states [12]. Using mRNA expression
(measured by DNA microarrays) as the “language” of cellular
response, the CMap measures how a panel of cancer cell lines
responds transcriptionally to a variety of different drug treatments.
This approach had previously been successful in identifying drug
mechanisms in yeast but had never been applied to cancer cells
[13]. The investigators profiled four different cancer cell lines
before and after treatment with a panel of more than 1000 small
molecules. The LINCS database is an updated version of this
profiling system with a much larger number of drugs and cell
lines. This database makes use of the LINC1000 expression
profiling system where the expression of 1000 key genes is
measured and used to infer the global gene expression profile.
From these transcriptional changes it is possible to explore a
drug’s mechanisms of action. These could be used to successfully
repurpose drugs for specific diseases or genetic states [14, 15].
3 Identification of Genomic Markers of Drug Response
A key first step to any drug response prediction effort involves the
identification of genomic markers that can impact efficacy. Identify-
ing those markers makes response prediction a much simpler task.
Once a polymorphism, gene expression pattern, or pathway has
been identified, all new samples can simply be screened for that
marker and, using known correlations with drug response, a pre-
diction of drug susceptibility can be made. Here, we focus on a
variety of approaches that can be used to identify genomic markers
indicative of drug response.
3.1 Using Genome- Genome-Wide Associate Studies (GWAS) have classically been used
Wide Associate to detect genetic variations associated with specific disease pheno-
Studies to Identify types. However, in recent years, the use of GWAS has proved to be a
Polymorphisms powerful method to identify polymorphisms that can affect drug
Related to Drug efficacy and toxicity [16]. Unlike approaches focusing on known
Response drug targets or candidate gene lists, GWAS provides a hypothesis-
free method that can systematically test a large number of variants
[17, 18]. In order to run a GWAS one must provide a measure of
response or toxicity for a large number of samples, as well as a
thorough genotyping of each sample.
GWA studies typically fall into two main categories depending
on whether the provided response measure is categorical (such as
case/control, responder/non-responder, adverse reaction/no
reactions, etc.) or quantitative (such as IC50 or a measure of side
effect severity). Recently, there have been a series of developments
improving the traditional GWAS, such as taking into account a
Table 2
Sample contingency table showing how we can use the number of responders with a certain SNP to
test whether it is related to drug efficacy
Responders Non-responders
SNP present 90 15
SNP absent 10 485
gene’s functional information [19], epistasis [20], or missing data

[21]. Here, we review the basic premise of the categorical and
quantitative GWA studies:
1. Categorical—The goal of a categorical GWAS is to identify SNPs
that are highly predictive of which category a given sample will
be assigned to. To begin with, samples are assigned to one of the
two categories based on either their response to a given drug or
the observation of a given adverse effect. For each observed
SNP, we count the number of samples where that SNP is present
(or absent). This data is then used to populate what is known as a
contingency table. For instance, if in a dataset with 100 respon-
ders and 500 non-responders we observe 90 responders with a
certain SNP and 15 non-responders with that same SNP, the
resulting contingency table is shown in Table 2. A statistical test
is then run on each contingency table to measure the deviation
from the null-hypothesis, which assumes that there is no associ-
ation between the SNP and categorical classes. The most com-
mon test used is either the chi-squared test (or the related
Fishers exact test). This approach has successfully identified
variants related to interferon beta [22] and anti-TNF treatment
efficacy [23] as well as variants predictive of statin-induced
myopathy [24].
2. Quantitative: Instead of using a contingency table test to detect
significantly associated SNPs, a quantitative GWAS traditionally
uses a generalized linear model (GLM), such as an Analysis of
Variance (ANOVA)—a variant of a linear regression analysis—to
identify SNPs that are highly correlated to the variable of interest
(such as drug IC50) [25]. Though more complicated than the
categorical case, there exist a number of public bioinformatics
software packages such as PLINK [26] or SNPTEST [27] that
can run quantitative GWAS and output a p-value for each poly-
morphism. While these analyses are less common for drug
response prediction because of the difficulty in measuring quan-
titative response values, various groups have successfully used
them to identify SNPs associated with susceptibility to chemo-
therapeutic drugs [28] or ACE inhibitors [29].
Sample Manhattan Plot
8
Significantly Associated Hits
6
-log 10( p)
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Chromosome
Fig. 1 Sample Manhattan plot showcasing how one can use the output of GWAS calculation to find SNPs
related to drug efficacy. Boxed hits represent those that pass the significant p value cutoff and thus may be
relevant to treatment response
Regardless of the type of GWAS used, the output is a set of p-

values, one for each polymorphism tested. One important caveat is
that all p-values must be corrected for multiple hypothesis testing
(MHT) to account for the large number of statistical tests being
performed. The most commonly used methods for MHT are the
Bonferroni or Benjamini-Hochberg corrections. Adjusted p-values
are then visualized using a Manhattan plot where the genomic
position of each SNP is plotted against the negative log of its p-
value (see Fig. 1). Using the Manhattan plot one can visually iden-
tify genomic regions or particular SNPs that are significantly asso-
ciated with the given response feature.
3.2 Using Gene While GWA studies aim to find a set of mutations or polymorph-
Expression to Find isms that are predictive of how a patient will respond to a drug,
Response Signatures another popular approach is using gene expression data to find an
and Predict Response expression signature associated with a positive (or negative)
response. Different transcriptional profiles can often lead to differ-
ent levels of drug efficacy, and differential expression analyses can
help pinpoint the specific genes or pathways that drive the hetero-
geneous drug response and can be used to predict response levels.
The classic approach involves treating a cohort of mice or
patients, or patient samples or cell lines with a given drug and
measuring the degree of response in each sample. Similar to a
GWAS, the response rate can be measured either categorically
(responder/non-responder) or as a continuous variable. Using
either sequencing data from before treatment or differential gene
expression (comparing pre and post-treatment samples) one can
Drug screening data

le
le
le
Identified genomic signature
ti b
ti b
ti b
t
t
tan
tan
tan
ep
ep
ep
si s
si s
si s
of response
sc
sc
sc
Su
Re
Re
Re
Su
Su
le
le
l
ti b
ti b
ti b
t
t
tan
tan
tan
ep
ep
ep
sis
sis
sis
sc
sc
sc
Su
Re
Re
Re
Su
Su
Differential Gene 1
Gene Analysis
Gene 2
New Sample Response Prediction
Gene 1 Predicted Gene 1 Predicted

Gene 2 susceptibility Gene 2 resistance
Fig. 2 Diagram on how gene expression patterns from responders and non-responders can be used to identify
signatures related to response and how these can be used to better select new patients likely to respond
search for gene expression patterns that seems more prevalent in the
samples that are susceptible (or resistant) to treatment (see Fig. 2).
For instance, one would expect to see genes that confer drug
resistance to be more highly expressed in samples where drug
treatment shows a limited effect.
A number of methods exist for detecting differential expression
across a set of samples. For microarray data oftentimes statistical
tests such as an ANOVA would suffice, but packages such as limma
[30] (see also Chapter 6 for an application of the limma package on
phosphoproteomics data) use linear models that can help deal with
more complicated experimental designs. For RNA-seq data the
most popular methods include a limma-voom [31], DESeq2
[32], edgeR [33], and cufflinks (cuffdiff) [34]. DESeq2 and
edgeR are currently considered the standard for differential expres-
sion analysis and both use similar underlying models (however with
different dispersion estimates). However, in our experience we have
found DESeq2 to be more conservative. One key difference
between DESeq2/edgeR and limma-voom is that voom does not
employ a negative binomial distribution and instead estimates the
mean variance relationship. Therefore, voom may be a better choice

if the input data differs strongly from a negative binomial distribu-
tion. Finally, one major difference between the cuffdiff pipeline and
DESeq2 is that cuffdiff acts on the level of transcripts while
DESeq2 uses gene counts as inputs. Additionally, Wright et al.
[35] used a Bayesian predictor to automatically separate samples
into subtypes based on their respective gene expression profiles,
and used the output p-values to find the set of genes most predictive
of subtype. This type of approach is useful for pooled sets of
samples without knowledge of their subtype—for instance when
one would like to determine if well-responding patients all fall into
a certain disease subtype [36]. While initially tested on microarray
data, this approach can be easily adapted to RNA-seq data and
could generally be adapted to all types of predictive models.
3.3 Using Pathway Often a differential gene expression analysis will have a set of genes
Annotations and GSEA as output, which has no obvious pattern or relevance to the type of
to Identify Differential drug being investigated. Additionally, it is quite common for a set
Biological States of genes to be marked as significant in a differential gene expression
analysis, but when experiments are done to perturb individual
genes they seem to have little to no effect on drug response. In
cases like these it is often helpful to translate the differentially
expressed genes into a set of enriched biological pathways or gene
sets. These can provide a broader explanation of a drug’s mecha-
nism of action and a clearer understanding on how to predict
efficacy. This approach has previously been successful not only in
drug response prediction, but also in the development of highly
effective drugs. Overexpression of the mTOR pathway in lym-
phoma led to the development of inhibitors to specifically target
genes in that pathway [37], and global activation of the epidermal
growth factor receptor pathway was found to be predictive of
erlotinib susceptibility in pancreatic cancer xenografts [38].
The basic technique to finding enriched pathways or canonical
gene sets is to first annotate each gene based on the pathways/sets
it falls into. A few popular resources for pathway and gene set
annotation include: the Molecular Signatures Database (MSigDB)
[39], Reactome [40, 41], the Kyoto Encyclopedia of Genes and
Genomes (KEGG) [42], Gene Ontologies [43], and InnateDB
[44, 45]. Reactome, KEGG, and InnateDB group genes based on
their biochemical pathways (with InnateDB focusing on pathways
relating to immunity), Gene Ontologies group genes based on their
biological/molecular function or cellular localization, and
MSigDB is a combination of all the aforementioned databases
with custom sets of “hallmark” gene sets, or important genes
involved in certain processes. Following annotation, a statistical
test (such as the Fishers exact test) can be used to test whether a
certain pathway is enriched for up (or down) regulated genes
compared to what would be expected by random chance.
Another popular method for testing pathway enrichment is

Gene Set Enrichment Analysis (GSEA) [46]. GSEA tests whether
genes of a certain pathway/set are differentially expressed between
the cases. It does this by computing an enrichment score for each
gene set—increase in score if genes in set are differentially
expressed, decrease in score if not—and using a number of permu-
tations (number can be set by the user) it tests whether that
enrichment score is significantly different than what would be
expected by chance. Packaged with the MSigDB gene sets, GSEA
has demonstrated success at identifying common biological path-
ways in independent lung cancer datasets while single-gene differ-
ential analyses could not [46].
4 Identifying Drug Targets and Mechanisms and Using Them to Improving

Response
4.1 Computational For a small molecule in development the mechanisms of action and
Techniques to Identify binding targets are often not fully understood. A number of
Drug Targets and computational methods exist that seek to predict targets for these
Mechanisms orphan small molecules, based either on chemical structure or on its
down-stream effects. These methods can broadly be divided into
three categories:
1. Molecular dynamics: Using intricate mathematical models,
molecular dynamics methods computationally simulate a
drug’s interaction with a given protein. To predict targets, an
orphan small molecule is tested against a series of proteins to
identify any with favorable binding results [47, 48]. However,
this approach requires significant computation power, complex
mathematical models, and full 3D structures for each queried
protein—data that is often unavailable.
2. Ligand-based [49, 50]: Using a set of known protein binding
partners for a given small molecule, ligand-based approaches
apply machine learning techniques to find other proteins with
high enough similarity to the known targets. The proteins with
high degrees of similarity are predicted to be novel binding
targets. However ligand-based methods often require a large
number of known binding partners for each tested small mole-
cule, and thus can mostly be used on drugs far enough in the
drug development phase.
3. Downstream effect based: Recently, a number of methods
emerged, which use the downstream effects of a small molecule
(such as induced gene expression change [51] or side-effects
[52]) to predict targets for orphan small molecules. The basic
premise of these methods is to compare the effects of an orphan
small molecule to the effects of drugs with known targets. If the
orphan molecule has an effect very similar to a drug with a

known target, one would predict this known target to also be a
target of the orphan small molecule. However, most current
methods only utilize a small number of the available data sources
and are thus not broadly applicable to all drug types. Our lab
recently developed BANDIT, a novel computational method
that integrates multiple different pieces of data on small mole-
cules to predict specific binding targets and mechanisms
[53]. When tested on a set of diverse drugs, BANDIT achieved
an accuracy of approximately 90% at identifying known targets
(validated using a standard cross validation setup), much higher
than expected from other target prediction methods.
Another popular option is to focus on a drug’s broad mecha-
nism of action rather than its specific binding targets. One way to
accomplish this is to observe how a given drug changes the tran-
scriptional profile in a sample. For example, using gene expression
data following cisplatin treatment, this type of analysis identified
the p53 response and other pathways to be involved in cisplatin
response [54]. This approach has become more practical with the
emergence of public databases such as the Connectivity Map
(CMap) [55]. From the CMap database, one can calculate fold
change values for each gene after drug treatment. Using GSEA or
other pathway enrichment methods, the fold change values can be
converted into a set of pathway scores that reveal which pathways
were enriched or mobilized. Though far less precise than specific
target identification, this information is easier to obtain and could
provide additional information on the context in which a given
drug could be used.
4.2 Using Known Assuming one can determine the mechanisms of action of a drug—
Drug Targets To either in terms of specific binding targets or broad knowledge on
Predict Response the biological pathways mobilized—the task of predicting efficacies
is often much simpler. For example, if a drug’s main mechanism of
action is to target Protein A, then one would expect different
efficacies in samples based on whether there is an amplification or
deletion of Protein A. This type of reasoning also applies when
there are mutations in a known drug target. Examples of this are
treatments involving Gefitinib or Herceptin. Gefitinib is an anti-
cancer small molecule known to target the EGFR kinase, and
mutations in EGFR were found to predict sensitivity of samples
to gefitinib treatment [56]. Herceptin, an antibody that targets
HER2, was found to improve the outcomes of cancer patients
with HER2 amplifications or activating mutations [57, 58]. Another
example of this concept is vemurafenib—a small molecule that
targets V600E BRAF mutation—that has been found to be selec-
tively effective in cancer patients with this exact mutation, while
having no beneficial effect on normal BRAF samples
[59–61]. These are just a few of the many examples showing how
combining known drug targets with targeted sequencing can help
detect instances of differential response.
However, it is also important to note that while the alterations
of a drug’s target are often predictive of efficacy, this is not always
the case, even if the target itself serves as a biomarker [62]. More-
over, there are often cases where the predictive biomarker for a
given drug is not the actual target, but rather another gene or set of
genes involved in the same pathway or biological processes as
drug’s target. In cases like these sequencing could still prove to
be a valuable tool, and we advise utilizing some of the other
methods mentioned in this chapter. Drug target information
could be used in combination with these methods to refine predic-
tions and gain greater biological insights.
Sequencing-based approaches also can be very successful in
positioning drugs for specific disease conditions—especially differ-
ent cancer types. Using resources like the Cancer Genome Atlas
(TCGA) [63] and Genotype-Tissue Expression (GTEx) project
[64], one can find genes or pathways that are significantly upregu-
lated in certain cancers or cancer types compared to either normal
tissue samples or other cancer subtypes. Identifying such cancer-
subtype-specific, upregulated signatures could highlight drugs
known to target these signatures as particularly viable candidates
for treatment. For instance, it was recently discovered that dopa-
mine receptors were selectively upregulated in neoplastic stem cells
in breast cancer. It was observed that thioridazine (a compound
known to target dopamine receptors) was particularly effective
against these cell populations [65].
4.3 Exploiting One approach that has become increasingly popular is exploiting
Genetic Interactions networks of synthetic lethality (SL) and synthetic dosage lethality
(SL/SDL) (SDL) to predict drug efficacy. SL describes a specific type of
genetic interactions involving two or more genes, where the loss
of either gene individually is non-fatal, but the combined loss of all
SL partner genes leads to a severe decrease in fitness or cell death.
SDL describes a related genetic interaction where lethality is
observed when one gene is lost while its SDL partner is overex-
pressed [66, 67]. Both SL and SDL interactions are highly relevant
to cancer biology, as most cancers have both widespread losses and
gains of certain genes. Exploiting these could drastically improve
patient prognosis. For instance, if Gene A and Gene B are in an SL
pair and Gene A is lost in a given cancer sample, then one would
expect compounds targeting Gene B to have better responses in
this sample (see Fig. 3).
To this end there have recently been many efforts to uncover
underlying SL and SDL networks in cancer. Among the most
successful efforts was the data mining synthetic lethality identifica-
tion pipeline DAISY [68]. DAISY uses three distinct hypotheses to
Fig. 3 (a) Diagram highlighting the concept of synthetic lethality and how known synthetic lethal relationships
can be combined with genomic information to better predict drug response. (b) Using synthetic lethality to
predict differential response
detect SL pairs (with the inverse hypotheses being used for SDL
pair detection):
1. Genes in an SL pair will have significantly lower raters of
co-mutation or co-loss.
2. Knockout/knockdown of a given gene will be more fatal in
samples with under-expression or loss of its SL partner.
3. Genes in an SL pair are more likely to be co-expressed.
By scanning for gene pairs that fulfill all three hypotheses,
DAISY predicted networks of SL and SDL interactions. It achieved
an accuracy level of approximately 77% (measured by Area Under
the Receiver Operating Curve) when compared to known SL inter-
actions, demonstrating that DAISY could accurately infer SL and
SDL genetic interactions. To translate this into predicting drug
responses, the authors identified sample-specific exploitable inter-
actions, or SDL interactions where one gene was overexpressed and
SL interactions where one gene was lost. DAISY then identified
drugs known to target the other gene in each exploitable interac-
tion. For each drug DAISY ranked the most sensitive samples based
on the number of exploitable interactions being targeted by each
drug. They found that specific drugs were significantly more effec-
tive in cell lines predicted to be sensitive than those predicted to be
resistant. Furthermore, the authors used a similar approach to
predict the exact IC50 value for each drug across a set of cancer
cell lines and observed a strong correlation between the predicted
and observed values (R ¼ 0.721). Taken together these results
show how known genetic interactions (particularly SL and SDL
interactions) can be combined with sequencing data to better
predict drug sensitivities and inform treatment.
5 Machine Learning Approaches
In cases where identification of response biomarkers is too complex

or the identified biomarkers do not reveal any underlying biological
insight, machine learning approaches, which can combine sequenc-
ing data with information such as biological networks, are very
powerful. The idea for employing machine learning approaches
for drug response prediction is for the computational algorithm
to learn how to combine a set of distinct features into a prediction
of sensitivity. Most machine learning methods for drug sensitivity
prediction are classified as supervised methods. Those supervised
methods use a set of sequenced samples with known drug sensitiv-
ities to “train” the algorithm and determine how to combine
features based on their predictive power (see Fig. 4). While the
linear regression model discussed earlier can be considered the
oldest form of machine learning, most popular methods currently
utilize more advanced modeling to account for the complexity in
genetic sequencing data. In fact, machine learning methods can
often detect higher order genomic markers of drug response that
other methods may have missed. One example is the use of machine
1 . Training Biological Networks

(optional)
Drug
Sensitivities
Samples
Genomic Data
Samples
Machine Learning
Model
Gene Expression Levels
2. Prediction
Genomic Data for New
Samples
Samples
Sensitivity Predictions
Machine Learning
Model
Samples
Gene Expression Levels
Fig. 4 Overview of how common machine-learning methods combine multiple data types to train a specific
model that can be applied to new samples to predict sensitivity
learning to identify the EWS-FL11 translocation in Ewing’s sar-

coma as a marker of sensitivity to PARP inhibitors [69].
Many methods seek to improve their performance by including
additional information on known biological networks, genetic
interactions, or drug chemical properties. For instance, Menden
et al. [70] found that including drug chemical information (such as
weight and lipophilicity) with sequencing data improved the per-
formance of both a neural network and random forest for sensitivity
prediction. In collaboration with the NCI, the Dialogue on Reverse
Engineering Assessment and methods (DREAM) project led a
community effort to improve drug sensitivity predictions
[71]. Through this effort, the NCI-DREAM consortium publicly
released drug sensitivity data for a set of breast cancer cell lines
along with thorough genetic, epigenetic, and proteomic sequenc-
ing data. Individual groups each submitted different sensitivity
prediction methods and the NCI-DREAM consortium analyzed
each method to identify any particular method features that led to
higher accuracies. Interestingly, they found that the inclusion of
annotated biological pathways was one of the two variables that
significantly boosted performance [71]. Additionally, the consor-
tium found that the top performing methods all utilized nonlinear
modeling, indicating that in many cases the connections between
individual genetic features and drug response are too complex to be
understood using a strictly linear approach. Finally, they observed
that though sensitivity to proteasome inhibitors tended to be pre-
dicted with the most accuracy, there was a predictive signal for most
of the drugs in their test set. This further indicated that machine
learning methods have the potential to significantly improve
sequencing-based drug response prediction.
6 Conclusion and Outlook
In the past two decades, there have been significant advances in

using genomic data and bioinformatics to better understand the
heterogeneous nature of drug response. By combining data on
genomic alterations and drug response with thorough statistical
methods we can identify specific predictive markers. Moreover,
through post-treatment genomic profiling we can gain a better
understanding of the mechanism and effect of a given drug. This
knowledge can then be used to better select patients or diseases
where that mechanism will provide the most therapeutic benefit.
Additionally, there has recently been an emergence of computa-
tional methods to identify drug targets when conventional
approaches fail. However, as the amount of data generated con-
tinues to increase and drugs targeting new pathways are developed,
we imagine that no single approach or method will provide high
enough accuracy. Therefore, we expect the field to move toward
using machine learning strategies that are able to integrate a variety

of different data-types into a single predictive output. We are
already seeing the creation of sophisticated methods for this pur-
pose and anticipate this to only improve over the coming years. All
together though we believe that the adoption of the methodology
described in this chapter not only has the power to expand our
understanding of pharmacology but can also significantly improve
the current schema of patient treatment.
Acknowledgments
The authors would like to thank the Elemento Lab members and
Natalie R. Davidson for their feedback and discussion. O.E. and
N.M. are supported by the CAREER grant from National Science
Foundation (DB1054964), NIH grant R01CA194547, the Starr
CancerFoundation, as well as by startup funds from the Institute
for Computational Biomedicine. Support for N.M. was also
provided by the PhRMA Foundation Pre Doctoral Informatics
Fellowship and by the Tri-Institutional Training Program in
Computational Biology and Medicine.
References
1. Fry RC, Svensson JP, Valiathan C, Wang E, 6. Abaan OD, Polley EC, Davis SR, Zhu YJ,
Hogan BJ, Bhattacharya S, Bugni JM, Whit- Bilke S, Walker RL, Pineda M, Gindin Y,
taker CA, Samson LD (2008) Genomic predic- Jiang Y, Reinhold WC, Holbeck SL, Simon
tors of interindividual differences in response RM, Doroshow JH, Pommier Y, Meltzer PS
to DNA damaging agents. Genes Dev 22 (2013) The exomes of the NCI-60 panel: a
(19):2621–2626. https://doi.org/10.1101/ genomic resource for cancer biology and sys-
gad.1688508 tems pharmacology. Cancer Res 73
2. Rice SD, Heinzman JM, Brower SL, Ervin PR, (14):4372–4382. https://doi.org/10.1158/
Song N, Shen K, Wang DK (2010) Analysis of 0008-5472.Can-12-3342
chemotherapeutic response heterogeneity and 7. Reinhold WC, Varma S, Sousa F, Sunshine M,
drug clustering based on mechanism of action Abaan OD, Davis SR, Reinhold SW, Kohn KW,
using an in vitro assay. Anticancer Res 30 Morris J, Meltzer PS, Doroshow JH, Pommier
(7):2805–2811 Y (2014) NCI-60 whole exome sequencing
3. Bosquet JG, Marchion DC, Chon H, Lancaster and pharmacological CellMiner analyses.
JM, Chanock S (2014) Analysis of chemother- PLoS One 9(7). https://doi.org/10.1371/
apeutic response in ovarian cancers using pub- journal.pone.0101670
licly available high-throughput data. Cancer 8. Scherf U, Ross DT, Waltham M, Smith LH,
Res 74(14):3902–3912. https://doi.org/10. Lee JK, Tanabe L, Kohn KW, Reinhold WC,
1158/0008-5472.CAN-14-0186 Myers TG, Andrews DT, Scudiero DA, Eisen
4. Sboner A, Elemento O (2016) A primer on MB, Sausville EA, Pommier Y, Botstein D,
precision medicine informatics. Brief Bioin- Brown PO, Weinstein JN (2000) A gene
form 17(1):145–153. https://doi.org/10. expression database for the molecular pharma-
1093/bib/bbv032 cology of cancer. Nat Genet 24(3):236–244.
5. Shoemaker RH (2006) The NCI60 human https://doi.org/10.1038/73439
tumour cell line anticancer drug screen. Nat 9. Gholami AM, Hahne H, Wu ZX, Auer FJ,
Rev Cancer 6(10):813–823. https://doi.org/ Meng C, Wilhelm M, Kuster B (2013) Global
10.1038/nrc1951 proteome analysis of the NCI-60 cell line
panel. Cell Rep 4(3):609–620. https://doi.
org/10.1016/j.celrep.2013.07.018
10. Barretina J, Caponigro G, Stransky N, (2016) A computational drug repositioning

Venkatesan K, Margolin AA, Kim S, Wilson approach for targeting oncogenic transcription
CJ, Lehar J, Kryukov GV, Sonkin D, factors. Cell Rep 15(11):2348–2356. https://
Reddy A, Liu M, Murray L, Berger MF, Mon- doi.org/10.1016/j.celrep.2016.05.037
ahan JE, Morais P, Meltzer J, Korejwa A, Jane- 15. Dudley JT, Deshpande T, Butte AJ (2011)
Valbuena J, Mapa FA, Thibault J, Bric- Exploiting drug-disease relationships for
Furlong E, Raman P, Shipway A, Engels IH, computational drug repositioning. Brief Bioin-
Cheng J, Yu GK, Yu JJ, Aspesi P, de Silva M, form 12(4):303–311. https://doi.org/10.
Jagtap K, Jones MD, Wang L, Hatton C, 1093/bib/bbr013
Palescandolo E, Gupta S, Mahan S, 16. Low SK, Takahashi A, Mushiroda T, Kubo M
Sougnez C, Onofrio RC, Liefeld T, (2014) Genome-wide association study: a use-
MacConaill L, Winckler W, Reich M, Li NX, ful tool to identify common genetic variants
Mesirov JP, Gabriel SB, Getz G, Ardlie K, associated with drug toxicity and efficacy in
Chan V, Myer VE, Weber BL, Porter J, cancer pharmacogenomics. Clin Cancer Res
Warmuth M, Finan P, Harris JL, 20(10):2541–2552. https://doi.org/10.
Meyerson M, Golub TR, Morrissey MP, Sellers 1158/1078-0432.Ccr-13-2755
WR, Schlegel R, Garraway LA (2012) The can-
cer cell line encyclopedia enables predictive 17. Zhou KX, Pearson ER (2013) Insights from
modelling of anticancer drug sensitivity genome-wide association studies of drug
(483:603, 2012). Nature 492 response. Annu Rev Pharmacol 53:299–310.
(7428):290–290. https://doi.org/10.1038/ https://doi.org/10.1146/annurev-pharmtox-
nature11735 011112-140237
11. Stransky N, Ghandi M, Kryukov GV, Garraway 18. McCarthy MI, Abecasis GR, Cardon LR,
LA, Lehar J, Liu M, Sonkin D, Kauffmann A, Goldstein DB, Little J, Ioannidis JPA, Hirsch-
Venkatesan K, Edelman EJ, Riester M, horn JN (2008) Genome-wide association
Barretina J, Caponigro G, Schlegel R, Sellers studies for complex traits: consensus, uncer-
WR, Stegmeier F, Morrissey M, Amzallag A, tainty and challenges. Nat Rev Genet 9
Pruteanu-Malinici I, Haber DA, (5):356–369. https://doi.org/10.1038/
Ramaswamy S, Benes CH, Menden MP, nrg2344
Iorio F, Stratton MR, McDermott U, Garnett 19. Xu ZL, Taylor JA (2009) SNPinfo: integrating
MJ, Saez-Rodriguez J, Canc DS, Line CC, GWAS and candidate gene information into
Inst B, Res NIB, Sensitivity GD, Hosp MG, functional SNP selection for genetic associa-
Lab EMB, Inst EB, Inst WTS (2015) Pharma- tion studies. Nucleic Acids Res 37:
cogenomic agreement between two cancer cell W600–W605. https://doi.org/10.1093/
line data sets. Nature 528(7580):84. https:// nar/gkp290
doi.org/10.1038/nature15736 20. McKinney BA, Pajewski NM (2011) Six
12. Lamb J, Crawford ED, Peck D, Modell JW, degrees of epistasis: statistical network models
Blat IC, Wrobel MJ, Lerner J, Brunet JP, for GWAS. Front Genet 2:109. https://doi.
Subramanian A, Ross KN, Reich M, org/10.3389/fgene.2011.00109
Hieronymus H, Wei G, Armstrong SA, Hag- 21. Howie B, Marchini J, Stephens M (2011)
garty SJ, Clemons PA, Wei R, Carr SA, Lander Genotype imputation with thousands of gen-
ES, Golub TR (2006) The connectivity map: omes. G3 (Bethesda) 1(6):457–470. https://
using gene-expression signatures to connect doi.org/10.1534/g3.111.001198
small molecules, genes, and disease. Science 22. Byun E, Caillier SJ, Montalban X, Villoslada P,
313(5795):1929–1935. https://doi.org/10. Fernandez O, Brassat D, Comabella M,
1126/science.1132939 Wang J, Barcellos LF, Baranzini SE, Oksenberg
13. Hughes TR, Marton MJ, Jones AR, Roberts JR (2008) Genome-wide pharmacogenomic
CJ, Stoughton R, Armour CD, Bennett HA, analysis of the response to interferon beta ther-
Coffey E, Dai HY, He YDD, Kidd MJ, King apy in multiple sclerosis. Arch Neurol Chicago
AM, Meyer MR, Slade D, Lum PY, Stepaniants 65(3):337–E332. https://doi.org/10.1001/
SB, Shoemaker DD, Gachotte D, archneurol.2008.47
Chakraburtty K, Simon J, Bard M, Friend SH 23. Liu CY, Batliwalla F, Li WT, Lee A,
(2000) Functional discovery via a compendium Roubenoff R, Beckman E, Khalili H,
of expression profiles. Cell 102(1):109–126. Damle A, Kern M, Furie R, Dupuis J, Plenge
https://doi.org/10.1016/S0092-8674(00) RM, Coenen MJH, Behrens TW, Carulli JP,
00015-5 Gregersen PK (2008) Genome-wide associa-
14. Gayvert KM, Dardenne E, Cheung C, Boland tion scan identifies candidate polymorphisms
MR, Lorberbaum T, Wanjala J, Chen Y, Rubin associated with differential response to anti-
MA, Tatonetti NP, Rickman DS, Elemento O TNF treatment in rheumatoid arthritis. Mol
Med 14(9-10):575–581. https://doi.org/10. NW, Timpson NJ, Zeggini E, Newport M,

2119/2008-00056.Liu Sirugo G, Lyons E, Vannberg F, Brown MA,
24. Link E, Parish S, Armitage J, Bowman L, Franklyn JA, Heward JM, Simmonds MJ, Hill
Heath S, Matsuda F, Gut I, Lathrop M, AVS, Bradbury LA, Farrar C, Pointon JJ,
Collins R, Grp SC (2008) SLCO1B1 variants Wordsmith P, Gough SCL, Seal S, Stratton
and statin-induced myopathy – a genomewide MR, Rahman N, Ban M, Goris A, Sawcer SJ,
study. New Engl J Med 359(8):789–799 Compston A, Conway D, Jallow M, Bump-
25. Bush WS, Moore JH (2012) Chapter 11: stead SJ, Chaney A, Downes K, Ghori MJR,
genome-wide association studies. PLoS Com- Gwilliam R, Inouye M, Keniry A, King E,
put Biol 8(12). https://doi.org/10.1371/jour McGinnis R, Potter S, Ravindrarajah R,
nal.pcbi.1002822 Whittaker P, Withers D, Easton D, Pereira-
Gale J, Hallgrimsdottir IB, Howie BN, Su Z,
26. Purcell S, Neale B, Todd-Brown K, Thomas L, Teo YY, Vukcevic D, Bentley D, Caulfield M,
Ferreira MAR, Bender D, Maller J, Sklar P, de Mathew CG, Worthington J, Consortium
Bakker PIW, Daly MJ, Sham PC (2007) WTCC, Syndicate BRGGS, Collaborat BCS
PLINK: a tool set for whole-genome associa- (2007) Genome-wide association study of
tion and population-based linkage analyses. 14,000 cases of seven common diseases and
Am J Hum Genet 81(3):559–575. https:// 3,000 shared controls. Nature 447
doi.org/10.1086/519795 (7145):661–678. https://doi.org/10.1038/
27. Burton PR, Clayton DG, Cardon LR, nature05911
Craddock N, Deloukas P, Duncanson A, 28. Gamazon ER, Huang RS, Cox NJ, Dolan ME
Kwiatkowski DP, McCarthy MI, Ouwehand (2010) Chemotherapeutic drug susceptibility
WH, Samani NJ, Todd JA, Donnelly P, Barrett associated SNPs are enriched in expression
JC, Davison D, Easton D, Evans D, Leung HT, quantitative trait loci. Proc Natl Acad Sci U S
Marchini JL, Morris AP, Spencer CCA, Tobin A 107(20):9287–9292. https://doi.org/10.
MD, Attwood AP, Boorman JP, Cant B, 1073/pnas.1001827107
Everson U, Hussey JM, Jolley JD, Knight AS,
Koch K, Meech E, Nutland S, Prowse CV, 29. Chung CM, Wang RY, Chen JW, Fann CSJ,
Stevens HE, Taylor NC, Walters GR, Walker Leu HB, Ho HY, Ting CT, Lin TH, Sheu SH,
NM, Watkins NA, Winzer T, Jones RW, McAr- Tsai WC, Chen JH, Jong YS, Lin SJ, Chen YT,
dle WL, Ring SM, Strachan DP, Pembrey M, Pan WH (2010) A genome-wide association
Breen G, St Clair D, Caesar S, Gordon-Smith- study identifies new loci for ACE activity:
K, Jones L, Fraser C, Green EK, Grozeva D, potential implications for response to ACE
Hamshere ML, Holmans PA, Jones IR, inhibitor. Pharmacogenomics J 10
Kirov G, Moskvina V, Nikolov I, O’Donovan (6):537–544. https://doi.org/10.1038/tpj.
MC, Owen MJ, Collier DA, Elkin A, Farmer A, 2009.70
Williamson R, McGuffin P, Young AH, Ferrier 30. Diboun I, Wernisch L, Orengo CA, Koltzen-
IN, Ball SG, Balmforth AJ, Barrett JH, Bishop burg M (2006) Microarray analysis after RNA
DT, Iles MM, Maqbool A, Yuldasheva N, Hall amplification can detect pronounced differ-
AS, Braund PS, Dixon RJ, Mangino M, ences in gene expression using limma. BMC
Stevens S, Thompson JR, Bredin F, Genomics 7. https://doi.org/10.1186/1471-
Tremelling M, Parkes M, Drummond H, Lees 2164-7-252
CW, Nimmo ER, Satsangi J, Fisher SA, 31. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW,
Forbes A, Lewis CM, Onnie CM, Prescott Shi W, Smyth GK (2015) limma powers differ-
NJ, Sanderson J, Mathew CG, Barbour J, ential expression analyses for RNA-sequencing
Mohiuddin MK, Todhunter CE, Mansfield and microarray studies. Nucleic Acids Res 43
JC, Ahmad T, Cummings FR, Jewell DP, (7):e47. https://doi.org/10.1093/nar/
Webster J, Brown MJ, Lathrop GM, gkv007
Connell J, Dominiczak A, Marcano CAB, 32. Anders S, Huber W (2010) Differential expres-
Burke B, Dobson R, Gungadoo J, Lee KL, sion analysis for sequence count data. Genome
Munroe PB, Newhouse SJ, Onipinla A, Biol 11(10). https://doi.org/10.1186/gb-
Wallace C, Xue MZ, Caulfield M, Farrall M, 2010-11-10-r106
Barton A, Bruce IN, Donovan H, Eyre S, Gil- 33. Robinson MD, McCarthy DJ, Smyth GK
bert PD, Hider SL, Hinks AM, John SL, (2010) edgeR: a bioconductor package for dif-
Potter C, Silman AJ, Symmons DPM, ferential expression analysis of digital gene
Thomson W, Worthington J, Dunger DB, expression data. Bioinformatics 26
Widmer B, Frayling TM, Freathy RM, (1):139–140. https://doi.org/10.1093/bioin
Lango H, Perry JRB, Shields BM, Weedon formatics/btp616
MN, Hattersley AT, Hitman GA, Walker M,
Elliott KS, Groves CJ, Lindgren CM, Rayner
34. Trapnell C, Roberts A, Goff L, Pertea G, Nucleic Acids Res 42(Database issue):
Kim D, Kelley DR, Pimentel H, Salzberg SL, D472–D477. https://doi.org/10.1093/nar/
Rinn JL, Pachter L (2012) Differential gene gkt1102
and transcript expression analysis of RNA-seq 42. Ogata H, Goto S, Sato K, Fujibuchi W,
experiments with TopHat and Cufflinks. Nat Bono H, Kanehisa M (1999) KEGG: Kyoto
Protoc 7(3):562–578. https://doi.org/10. encyclopedia of genes and genomes. Nucleic
1038/nprot.2012.016 Acids Res 27(1):29–34. https://doi.org/10.
35. Wright G, Tan B, Rosenwald A, Hurt EH, 1093/nar/27.1.29
Wiestner A, Staudt LM (2003) A gene 43. Ashburner M, Ball CA, Blake JA, Botstein D,
expression-based method to diagnose clinically Butler H, Cherry JM, Davis AP, Dolinski K,
distinct subgroups of diffuse large B cell lym- Dwight SS, Eppig JT, Harris MA, Hill DP,
phoma. Proc Natl Acad Sci U S A 100 Issel-Tarver L, Kasarskis A, Lewis S, Matese
(17):9991–9996. https://doi.org/10.1073/ JC, Richardson JE, Ringwald M, Rubin GM,
pnas.1732008100 Sherlock G (2000) Gene ontology: tool for the
36. Lam LT, Davis RE, Pierce J, Hepperle M, Xu Y, unification of biology. The gene ontology con-
Hottelet M, Nong Y, Wen D, Adams J, sortium. Nat Genet 25(1):25–29. https://doi.
Dang L, Staudt LM (2005) Small molecule org/10.1038/75556
inhibitors of IkappaB kinase are selectively 44. Breuer K, Foroushani AK, Laird MR, Chen C,
toxic for subgroups of diffuse large B-cell lym- Sribnaia A, Lo R, Winsor GL, Hancock RE,
phoma defined by gene expression profiling. Brinkman FS, Lynn DJ (2013) InnateDB: sys-
Clin Cancer Res 11(1):28–40 tems biology of innate immunity and beyond—
37. Briones J (2009) Emerging therapies for B-cell recent updates and continuing curation.
non-Hodgkin lymphoma. Expert Rev Antican- Nucleic Acids Res 41(Database issue):
cer 9(9):1305–1316. https://doi.org/10. D1228–D1233. https://doi.org/10.1093/
1586/Era.09.86 nar/gks1147
38. Jimeno A, Tan AC, Coffa J, Rajeshkumar NV, 45. Lynn DJ, Winsor GL, Chan C, Richard N,
Kulesza P, Rubio-Viqueira B, Wheelhouse J, Laird MR, Barsky A, Gardy JL, Roche FM,
Diosdado B, Messersmith WA, Lacobuzio- Chan TH, Shah N, Lo R, Naseer M, Que J,
Donahue C, Maitra A, Varella-Garcia M, Yau M, Acab M, Tulpan D, Whiteside MD,
Hirsch FR, Meijer GA, Hidalgo M (2008) Chikatamarla A, Mah B, Munzner T,
Coordinated epidermal growth factor receptor Hokamp K, Hancock RE, Brinkman FS
pathway gene overexpression predicts epider- (2008) InnateDB: facilitating systems-level
mal growth factor receptor inhibitor sensitivity analyses of the mammalian innate immune
in pancreatic cancer. Cancer Res 68 response. Mol Syst Biol 4:218. https://doi.
(8):2841–2849. https://doi.org/10.1158/ org/10.1038/msb.2008.55
0008-5472.Can-07-5200 46. Subramanian A, Tamayo P, Mootha VK,
39. Liberzon A, Subramanian A, Pinchback R, Mukherjee S, Ebert BL, Gillette MA,
Thorvaldsdottir H, Tamayo P, Mesirov JP Paulovich A, Pomeroy SL, Golub TR, Lander
(2011) Molecular signatures database ES, Mesirov JP (2005) Gene set enrichment
(MSigDB) 3.0. Bioinformatics 27 analysis: a knowledge-based approach for inter-
(12):1739–1740. https://doi.org/10.1093/ preting genome-wide expression profiles. Proc
bioinformatics/btr260 Natl Acad Sci U S A 102(43):15545–15550.
40. Fabregat A, Sidiropoulos K, Garapati P, https://doi.org/10.1073/pnas.0506580102
Gillespie M, Hausmann K, Haw R, Jassal B, 47. Li HL, Gao ZT, Kang L, Zhang HL, Yang K,
Jupe S, Korninger F, McKay S, Matthews L, Yu KQ, Luo XM, Zhu WL, Chen KX, Shen JH,
May B, Milacic M, Rothfels K, Shamovsky V, Wang XC, Jiang HL (2006) TarFisDock: a web
Webber M, Weiser J, Williams M, Wu G, server for identifying drug targets with docking
Stein L, Hermjakob H, D’Eustachio P (2016) approach. Nucleic Acids Res 34:W219–W224.
The reactome pathway knowledgebase. https://doi.org/10.1093/nar/gkl114
Nucleic Acids Res 44(D1):D481–D487. 48. Rarey M, Kramer B, Lengauer T, Klebe G
https://doi.org/10.1093/nar/gkv1351 (1996) A fast flexible docking method using
41. Croft D, Mundo AF, Haw R, Milacic M, an incremental construction algorithm. J Mol
Weiser J, Wu G, Caudy M, Garapati P, Biol 261(3):470–489. https://doi.org/10.
Gillespie M, Kamdar MR, Jassal B, Jupe S, 1006/jmbi.1996.0477
Matthews L, May B, Palatnik S, Rothfels K, 49. Butina D, Segall MD, Frankcombe K (2002)
Shamovsky V, Song H, Williams M, Birney E, Predicting ADME properties in silico: methods
Hermjakob H, Stein L, D’Eustachio P (2014) and models. Drug Discov Today 7(11):
The reactome pathway knowledgebase.
S83–S88. https://doi.org/10.1016/S1359- (2005) Trastuzumab plus adjuvant chemother-

6446(02)02288-2 apy for operable HER2-positive breast cancer.
50. Nantasenamat C, Isarankura-Na-Ayudhya C, N Engl J Med 353(16):1673–1684. https://
Prachayasittikul V (2010) Advances in compu- doi.org/10.1056/NEJMoa052122
tational methods to predict the biological activ- 59. Young K, Minchom A, Larkin J (2012) BRIM-
ity of compounds. Expert Opin Drug Dis 5 1, -2 and -3 trials: improved survival with
(7):633–654. https://doi.org/10.1517/ vemurafenib in metastatic melanoma patients
17460441.2010.492827 with a BRAF(V600E) mutation. Future
51. Wang KJ, Sun JZ, Zhou SF, Wan CL, Qin SY, Oncol 8(5):499–507. https://doi.org/10.
Li C, He L, Yang L (2013) Prediction of drug- 2217/fon.12.43
target interactions for drug repositioning only 60. Chapman PB, Hauschild A, Robert C, Haanen
based on genomic expression similarity. PLoS JB, Ascierto P, Larkin J, Dummer R, Garbe C,
Comput Biol 9(11):e1003315. https://doi. Testori A, Maio M, Hogg D, Lorigan P,
org/10.1371/journal.pcbi.1003315 Lebbe C, Jouary T, Schadendorf D, Ribas A,
52. Campillos M, Kuhn M, Gavin AC, Jensen LJ, O’Day SJ, Sosman JA, Kirkwood JM, Egger-
Bork P (2008) Drug target identification using mont AM, Dreno B, Nolop K, Li J, Nelson B,
side-effect similarity. Science 321 Hou J, Lee RJ, Flaherty KT, GA MA, Group
(5886):263–266. https://doi.org/10.1126/ B-S (2011) Improved survival with vemurafe-
science.1158140 nib in melanoma with BRAF V600E mutation.
53. Madhukar NS, Huang L, Khade P, Gayvert K, N Engl J Med 364(26):2507–2516. https://
Giannakakou P, Elemento O (2015) Abstract doi.org/10.1056/NEJMoa1103782
B162: small molecule target prediction and 61. Bollag G, Tsai J, Zhang J, Zhang C, Ibrahim P,
identification of novel anti-cancer compounds Nolop K, Hirth P (2012) Vemurafenib: the
using a data-driven bayesian approach. Mol first drug approved for BRAF-mutant cancer.
Cancer Ther 14(12 Supplement 2):B162. Nat Rev Drug Discov 11(11):873–886.
https://doi.org/10.1158/1535-7163.targ- https://doi.org/10.1038/nrd3847
15-b162 62. Vogel CL, Cobleigh MA, Tripathy D, Gutheil
54. Li J, Wood WH, Becker KG, Weeraratna AT, JC, Harris LN, Fehrenbacher L, Slamon DJ,
Morin PJ (2007) Gene expression response to Murphy M, Novotny WF, Burchmore M,
cisplatin treatment in drug-sensitive and drug- Shak S, Stewart SJ, Press M (2002) Efficacy
resistant ovarian cancer cells. Oncogene 26 and safety of trastuzumab as a single agent in
(20):2860–2872. https://doi.org/10.1038/ first-line treatment of HER2-overexpressing
sj.onc.1210086 metastatic breast cancer. J Clin Oncol 20
55. Lamb J (2007) The connectivity map: a new (3):719–726. https://doi.org/10.1200/JCO.
tool for biomedical research. Nat Rev Cancer 7 2002.20.3.719
(1):54–60. https://doi.org/10.1038/ 63. Brennan CW, Verhaak RGW, McKenna A,
nrc2044 Campos B, Noushmehr H, Salama SR, Zheng
56. Paez JG, Janne PA, Lee JC, Tracy S, SY, Chakravarty D, Sanborn JZ, Berman SH,
Greulich H, Gabriel S, Herman P, Kaye FJ, Beroukhim R, Bernard B, Wu CJ, Genovese G,
Lindeman N, Boggon TJ, Naoki K, Sasaki H, Shmulevich I, Barnholtz-Sloan J, Zou LH,
Fujii Y, Eck MJ, Sellers WR, Johnson BE, Vegesna R, Shukla SA, Ciriello G, Yung WK,
Meyerson M (2004) EGFR mutations in lung Zhang W, Sougnez C, Mikkelsen T, Aldape K,
cancer: correlation with clinical response to Bigner DD, Van Meir EG, Prados M, Sloan A,
gefitinib therapy. Science 304 Black KL, Eschbacher J, Finocchiaro G,
(5676):1497–1500. https://doi.org/10. Friedman W, Andrews DW, Guha A,
1126/science.1099314 Iacocca M, O’Neill BP, Foltz G, Myers J, Wei-
senberger DJ, Penny R, Kucherlapati R, Perou
57. Cappuzzo F, Bemis L, Varella-Garcia M (2006) CM, Hayes DN, Gibbs R, Marra M, Mills GB,
HER2 mutation and response to trastuzumab Lander E, Spellman P, Wilson R, Sander C,
therapy in non-small-cell lung cancer. N Engl J Weinstein J, Meyerson M, Gabriel S, Laird
Med 354(24):2619–2621. https://doi.org/ PW, Haussler D, Getz G, Chin L, Network
10.1056/NEJMc060020 TR (2013) The somatic genomic landscape of
58. Romond EH, Perez EA, Bryant J, Suman VJ, glioblastoma. Cell 155(2):462–477. https://
Geyer CE Jr, Davidson NE, Tan-Chiu E, doi.org/10.1016/j.cell.2013.09.034
Martino S, Paik S, Kaufman PA, Swain SM, 64. Lonsdale J, Thomas J, Salvatore M, Phillips R,
Pisansky TM, Fehrenbacher L, Kutteh LA, Lo E, Shad S, Hasz R, Walters G, Garcia F,
Vogel VG, Visscher DW, Yothers G, Jenkins Young N, Foster B, Moser M, Karasik E,
RB, Brown AM, Dakhil SR, Mamounas EP, Gillard B, Ramsey K, Sullivan S, Bridge J,
Lingle WL, Klein PM, Ingle JN, Wolmark N
Magazine H, Syron J, Fleming J, Siminoff L, 67. Chan DA, Giaccia AJ (2011) Harnessing syn-
Traino H, Mosavel M, Barker L, Jewell S, thetic lethal interactions in anticancer drug dis-
Rohrer D, Maxim D, Filkins D, Harbach P, covery. Nat Rev Drug Discov 10(5):351–364.
Cortadillo E, Berghuis B, Turner L, https://doi.org/10.1038/nrd3374
Hudson E, Feenstra K, Sobin L, Robb J, 68. Jerby-Arnon L, Pfetzer N, Waldman YY,
Branton P, Korzeniewski G, Shive C, McGarry L, James D, Shanks E, Seashore-
Tabor D, Qi LQ, Groch K, Nampally S, Ludlow B, Weinstock A, Geiger T, Clemons
Buia S, Zimmerman A, Smith A, Burges R, PA, Gottlieb E, Ruppin E (2014) Predicting
Robinson K, Valentino K, Bradbury D, cancer-specific vulnerability via data-driven
Cosentino M, Diaz-Mayoral N, Kennedy M, detection of synthetic lethality. Cell 158
Engel T, Williams P, Erickson K, Ardlie K, (5):1199–1209. https://doi.org/10.1016/j.
Winckler W, Getz G, DeLuca D, cell.2014.07.027
MacArthur D, Kellis M, Thomson A, 69. Garnett MJ, Edelman EJ, Heidorn SJ, Green-
Young T, Gelfand E, Donovan M, Meng Y, man CD, Dastur A, Lau KW, Greninger P,
Grant G, Mash D, Marcus Y, Basile M, Liu J, Thompson IR, Luo X, Soares J, Liu Q,
Zhu J, Tu ZD, Cox NJ, Nicolae DL, Gamazon Iorio F, Surdez D, Chen L, Milano RJ, Bignell
ER, Im HK, Konkashbaev A, Pritchard J, GR, Tam AT, Davies H, Stevenson JA,
Stevens M, Flutre T, Wen XQ, Dermitzakis Barthorpe S, Lutz SR, Kogera F, Lawrence K,
ET, Lappalainen T, Guigo R, Monlong J, McLaren-Douglas A, Mitropoulos X,
Sammeth M, Koller D, Battle A, Mostafavi S, Mironenko T, Thi H, Richardson L, Zhou W,
McCarthy M, Rivas M, Maller J, Rusyn I, Jewitt F, Zhang T, O’Brien P, Boisvert JL,
Nobel A, Wright F, Shabalin A, Feolo M, Price S, Hur W, Yang W, Deng X, Butler A,
Sharopova N, Sturcke A, Paschal J, Anderson Choi HG, Chang JW, Baselga J, Stamenkovic I,
JM, Wilder EL, Derr LK, Green ED, Struew- Engelman JA, Sharma SV, Delattre O, Saez-
ing JP, Temple G, Volpi S, Boyer JT, Thomson Rodriguez J, Gray NS, Settleman J, Futreal
EJ, Guyer MS, Ng C, Abdallah A, PA, Haber DA, Stratton MR, Ramaswamy S,
Colantuoni D, Insel TR, Koester SE, Little McDermott U, Benes CH (2012) Systematic
AR, Bender PK, Lehner T, Yao Y, Compton identification of genomic markers of drug sen-
CC, Vaught JB, Sawyer S, Lockhart NC, sitivity in cancer cells. Nature 483
Demchok J, Moore HF (2013) The (7391):570–575. https://doi.org/10.1038/
genotype-tissue expression (GTEx) project. nature11005
Nat Genet 45(6):580–585. https://doi.org/
10.1038/ng.2653 70. Menden MP, Iorio F, Garnett M,
McDermott U, Benes CH, Ballester PJ, Saez-
65. Sachlos E, Risueno RM, Laronde S, Rodriguez J (2013) Machine learning predic-
Shapovalova Z, Lee JH, Russell J, Malig M, tion of cancer cell sensitivity to drugs based on
McNicol JD, Fiebig-Comyn A, Graham M, genomic and chemical properties. PLoS One 8
Levadoux-Martin M, Lee JB, Giacomelli AO, (4). https://doi.org/10.1371/journal.pone.
Hassell JA, Fischer-Russell D, Trus MR, 0061318
Foley R, Leber B, Xenocostas A, Brown ED,
Collins TJ, Bhatia M (2012) Identification of 71. Costello JC, Heiser LM, Georgii E, Gonen M,
drugs including a dopamine receptor antago- Menden MP, Wang NJ, Bansal M, Ammad-ud-
nist that selectively target cancer stem cells. Cell din M, Hintsanen P, Khan SA, Mpindi JP,
149(6):1284–1297. https://doi.org/10. Kallioniemi O, Honkela A, Aittokallio T,
1016/j.cell.2012.03.049 Wennerberg K, Collins JJ, Gallahan D,
Singer D, Saez-Rodriguez J, Kaski S, Gray
66. Madhukar NS, Elemento O, Pandey G (2015) JW, Stolovitzky G, Community ND (2014) A
Prediction of genetic interactions using community effort to assess and improve drug
machine learning and network properties. sensitivity prediction algorithms. Nat Biotech-
Front Bioeng Biotechnol 3(172). https://doi. nol 32(12):1202–U1257. https://doi.org/10.
org/10.3389/fbioe.2015.00172 1038/nbt.2877
Chapter 15
A Robust Optimization Approach to Cancer Treatment under

Toxicity Uncertainty
Junfeng Zhu, Hamidreza Badri, and Kevin Leder
Abstract
The design of optimal protocols plays an important role in cancer treatment. However, in clinical applica-
tions, the outcomes under the optimal protocols are sensitive to variations of parameter settings such as
drug effects and the attributes of age, weight, and health conditions in human subjects. One approach to
overcoming this challenge is to formulate the problem of finding an optimal treatment protocol as a robust
optimization problem (ROP) that takes parameter uncertainty into account. In this chapter, we describe a
method to model toxicity uncertainty. We then apply a mixed integer ROP to derive the optimal protocols
that minimize the cumulative tumor size. While our method may be applied to other cancers, in this work
we focus on the treatment of chronic myeloid leukemia (CML) with tyrosine kinase inhibitors (TKI). For
simplicity, we focus on one particular mode of toxicity arising from TKI therapy, low blood cell counts, in
particular low absolute neutrophil count (ANC). We develop optimization methods for locating optimal
treatment protocols assuming that the rate of decrease of ANC varies within a given interval. We further
investigated the relationship between parameter uncertainty and optimal protocols. Our results suggest
that the dosing schedule can significantly reduce tumor size without recurrence in 360 weeks while insuring
that toxicity constraints are satisfied for all realizations of uncertain parameters.
Key words Robust optimization, Mixed integer optimization, Cancer treatment, Toxicity uncertainty
1 Introduction
An important problem in the study of cancer is the development of

resistance to anti-cancer therapies. In particular, resistance-
mediated treatment failure has been a problem for several block-
buster anti-cancer therapies [1, 2]. The problem of therapy resis-
tance has been extensively studied from the perspective of
evolutionary biology [3]. For example, in [4], the authors devel-
oped a stochastic model with experimental data to study the likeli-
hood, composition, and diversity of pre-existing resistance. Their
results show that there is at most one resistant clone present at the
time of diagnosis for most patients. In another work [5], the
authors constructed a stochastic model to study the timing of
resistance-mediated treatment failure. They found that in the
297
298 Junfeng Zhu et al.
setting of treatment of non-small cell lung cancer with the targeted

therapy Tarceva it is possible that treatment is discontinued too
early.
Mathematical models of cancer evolution during treatment
have the potential to be very useful in the creation of optimal
treatment schedules. If one can construct computationally tractable
mathematical models of cancer evolution under treatment, then it
is possible to compare various treatment regimens and thereby
search for the most effective regimen. A significant hurdle in the
use of such models is parameter variability and uncertainty. In
particular, one may have a computationally tractable mathematical
model for tumor evolution during treatment, but finding an opti-
mal treatment regimen for a patient requires knowing the model
parameters for that patient. One possible solution to this problem is
to estimate model parameters for a specific patient [6]. However,
this approach is often hindered by a lack of sequential tumor size
data for individual patients. An alternative approach is developing
optimal treatment schedules that are robust to uncertainty in model
parameters. In [7] we developed an approach for optimizing radia-
tion therapy schedules in the presence of uncertain model
parameters.
An exciting development of the past 15 years of cancer medi-
cine is the development of new small molecule pharmaceutical
agents that specifically target cancer cells [8, 9]. One stunning
success has been in the treatment of chronic myeloid leukemia
(CML) with the tyrosine kinase inhibitor (TKI) imatinib
[10]. Since the launch of imatinib several other TKIs have been
developed that are effective in the treatment of CML, e.g., dasati-
nib and nilotinib [11, 12]. In general, these drugs target the fusion
protein BCR-ABL which results in the unchecked proliferation of
CML cells [43]. While TKI therapy has been largely successful, a
fraction of patients’ experience treatment failure due to the evolu-
tion of mutated cancer cells that are resistant to the TKI they have
been treated with. For example, in [13] researchers reported that
the failure rate at 60 months for patients receiving imatinib was
17%. One possible method for reducing the risk of this evolved
resistance is to treat patients with a variety of TKIs thereby reducing
the risk of treatment failure due to a cell that is resistant to a
specific TKI.
This leads to a challenging optimization problem where the
goal is to decide on a sequence of TKI therapies that maximize
patient outcomes. In our earlier work [14] we considered this
problem, and worked with a mathematical model to study the
evolution of CML and normal blood cells under treatment with a
variety of TKIs. A potential roadblock for clinical implementation
of our prior work is that model parameters are difficult to estimate
accurately. Therefore, in the current work we consider an extension
of [14] by allowing for uncertainty in model parameters.
Robust Optimization with Toxicity Uncertainty 299
The structure of the chapter is the following. In Subheading 2,

we review general literature on optimization of cancer therapy in
continuous time. In Subheading 3, we present our methodology
for solving cancer treatment optimization problems with uncertain
toxicity response. This is done in the context of treatment of CML
with multiple TKI. In Subheading 4, we present numerical results
in the setting of CML. We conclude the Chapter by summarizing
our innovative methodology and providing insight into clinical
management.
2 General Models
The general statement of an optimization problem in cancer ther-

apy consists of the objective function, the control system of cell
dynamics, and the toxicity constraints. Optimal control theory is
widely used in the design of treatment protocol problems. The
general form for a continuous-time cancer optimization problem
can be described as follows:
minJ ð1aÞ
s:t: x_ ðt Þ ¼ f ðx; y Þ ð1bÞ
y_ ðt Þ ¼ g ðx; y Þ ð1cÞ
fbðx ðt Þ; y ðt ÞÞ 0 ð1dÞ
f~ ðx ðt Þ; y ðt ÞÞ ¼ 0 ð1eÞ
x min x ðt Þ x max ð1f Þ
y min y ðt Þ y max ð1gÞ
where J is the objective function and is determined by the intended

outcome of the therapy, x(t) ¼ (x1(t), x2(t), . . . , xn 1(t),
xn(t)) is the state vector which represents the population of
n different types of cells at time t, e.g., normal, wild type, or mutant
cells, and y(t) ¼ (y1(t), y2(t), . . . , yl 1(t), yl(t)) is the control
vector which represents the l control types such as drug dosages,
treatment methods (i.e., chemotherapy, radiation therapy, TKIs) or
which drug will be applied during the treatment. The equation
x_ ðt Þ ¼ f ðx; y Þ is a differential equation governing the cell dynamics.
The equation y_ ðt Þ ¼ g ðx; y Þ is a differential equation that governs
the drug levels in the system as a function of cell population, i.e.,
the relationship between drug dosage and tumor sizes with respect
to time t. Equations 1d and 1e indicate that the x(t) and y(t) may be
constrained by inequality and equality constraints, and Eqs. 1f and
1g indicate the lower and upper bounds for x(t) and y(t),
respectively.
2.1 Objective The role of the objective function in Eq. 1 is to specify the desired
Functions outcome of the course of anti-cancer therapy. The simplest form of
an objective function is to minimize the tumor population at the
end of treatment [15], i.e.,
J ¼ C ðT Þ ð2Þ
where C(T) is the tumor cell population at time t and T is a given
constant parameter indicating the length of treatment period.
Although objective functions of the form (Eq. 2) are easy to imple-
ment, they suffer from the drawback that they allow for large tumor
populations during treatment. To deal with this shortcoming,
Murray et al. [16] minimized the total tumor cell population over
the interval [0, T] while limiting the side effects of therapy. In
particular, they consider the objective function
ðT
J ¼ ðα1 C ðt Þ þ α2 S e ðt ÞÞdt
0
where Se(t) is a function modeling side effects. It can be a function
of dosage [17], or loss of body weight [18], and α1 and α2 are
weighting values for the cumulative tumor population and normal
tissues toxicity, respectively. Note that if one chooses parameter
α2 as zero, then the goal is to minimize the cumulative tumor
population over the time frame [0, T].
2.2 Tumor Growth Most optimization models of cancer therapy assume that tumor
Models growth can be accurately modeled by a set of differential equations
(usually ordinary differential equations). Some important questions
to consider when building these kinds of models are how the tumor
cells grow, how they interact, and how they are affected by anti-
cancer therapy. The simplest tumor growth model assumes that all
tumor cells proliferate with constant cell cycle duration which
results in an exponential growth model:
x_ ðt Þ ¼ λx ðt Þ
where x(t) is the tumor size at time t, and λ is a constant related to
the net-growth rate of the tumor. By using a single parameter, an
exponential growth model can capture some key features of the
beginning phase of tumor growth. However, the prediction of
tumor size based on the exponential growth model does not
match well with clinical datasets, since the exponential model will
give unreasonably large values over a long time. In particular,
limited nutrient availability for large tumors makes the exponential
growth an inappropriate model for tumor growth [19]. To over-
come this drawback researchers often use models such as logistic or
Gompertz models, where the growth rate decays as the tumor
population increases [19]. Thus, as t increases, tumor size
converges to a maximal volume, the so-called carrying capacity,

denoted by K. The logistic growth model is defined based on a
linear reduction in the tumor growth which is proportional to the
tumor size [20]:

x ðt Þ
x_ ðt Þ ¼ λx ðt Þ 1
K
Like the logistic growth model, the Gompertz growth model

assumes that decreasing growth rate is due to competition for the
nutrients in a more densely populated tumor
K
x_ ðt Þ ¼ λx ðt Þ ln ð3Þ
x ðt Þ
In [20, 21], the authors propose a modified Gompertz model,

which incorporates drug concentration. The dynamics of drug
concentration are modeled by the following equation:
v_ ðt Þ ¼ uðt Þ βv ðt Þ
where v(t) is the drug concentration at time t and u(t) is a piecewise

continuous function in time that indicates the rate of drug infusion.
Drug concentration falls by a fraction of βv(t) over the time dt. The
authors assume that the net-growth of a tumor cell population
comes from two sources: tumor growth due to cell proliferation
and tumor shrinkage due to drug administration. The tumor
growth is modeled by a general Gompertz model, i.e., Eq. 3. For
modeling cell death, they make two more assumptions: the tumor
size linearly decreases x(t), and tumor killing stops if drug concen-
tration drops below vth. In summary, the tumor cell kill is given by
the function:
L ðx ðt Þ; vðt ÞÞ ¼ kðvðt Þ vth ÞH ðvðt Þ vth Þx ðt Þ
where k is the proportion of tumor cells killed per unit time per unit
drug concentration, and H is the Heaviside step function which is a
discontinuous function whose value is zero for negative argument
and one for positive argument [22]:

0; if vðt Þ < v th
H ðvðt Þ vth Þ ¼
1; if vðt Þ v th
The cell dynamics are described as

K
x_ ðt Þ ¼ γx ðt Þ ln L ðx ðt Þ; vðt ÞÞx ðt Þ
x ðt Þ
In [23], the authors propose a model to describe the dynamics

of acute myeloblastic leukemia (AML). The two cell types
considered in this model are normal and leukemic hematopoietic

cells. The authors assume that the leukemic population inhibits the
growth of normal cells, and that drug treatment can kill both
leukemic and normal cells. The models are described in the
following:

_L ðt Þ ¼ g logL A L fL Kvðt ÞL
L

NA
N_ ðt Þ ¼ a log N bN cNL huðt ÞN þ G ðt Þ
N
where L(t) and N(t) denote the population of leukemic and
normal cells at time t, respectively. Parameters g and a represent the
birth rates of leukemic and normal cells, respectively, f and b are the
death rates of leukemic and normal cells, respectively. The parame-
ter c is the degree of inhibition exercised by the leukemic cells over
the normal cells, while LA and NA are the carrying capacities of
leukemic cells and normal cells, respectively. Parameters k and
h represent the drug’s effect on killing of both leukemic and normal
cells. Finally, G(t) is the regrowth rate of normal cells due to the
infusion and action of recombinant hemolytic growth factors.
A four-compartment model is proposed to explain the kinetics
of the molecular response to imatinib [24]. There are three differ-
ent cell types in the model: normal cells, wild-type leukemic cells,
and mutant leukemic cells. For each cell type, the authors consid-
ered four layers: stem cells (SC), progenitor cells (PC), differen-
tiated cells (DC), and terminally differentiated cells (TC). Wild-
type leukemic cells can acquire mutations that confer resistance to
imatinib at rate μ. The authors assume that imatinib only decreases
the birth rates of mutant PC and DC. The basic model is given by
h i
SC : x_ 0 ¼ ½λðx 0 Þ d 0 x 0 y_ 0 ¼ r y ð1 μÞ d 0 y 0 z_ 0 ¼ ðr z d 0 Þz 0 þ r y y 0 μ
PC : x_ 1 ¼ a x x 0 d 1 x 1 y_ 1 ¼ a y y 0 d 1 y 1 z_ 1 ¼ az z 0 d 1 z 1
DC : x_ 2 ¼ b x x 1 d 2 x 2 y_ 2 ¼ b y y 1 d 2 y 2 z_ 2 ¼ b z z 1 d 2 z 2
TC : x_ 3 ¼ c x x 2 d 3 x 3 y ˙3 ¼ c y y 2 d 3 y 3 z_ 3 ¼ c z z 2 d 3 z 3
where x0, x1, x2, and x3 indicate the populations of normal SC, PC,
DC, and TC, respectively. y0, y1, y2, and y3 indicate the populations
of wild-type leukemic SC, PC, DC, and TC, respectively. z0, z1, z2,
and z3 indicate the populations of mutant leukemic SC, PC, DC,
and TC, respectively. The rate constants are given by a, b, and c with
appropriate indices between normal, wild type, and mutant leuke-
mic cells. d0, d1, d2, and d3 indicate the death rates of SC, PC, DC,
and TC, respectively. λ is a decreasing function describing the
homeostasis of normal SC. ry and rz are the birth rates of sensitive
leukemic and resistant leukemic SC, respectively. In our previous
work [14] an optimization problem was designed based on an

extension of this model that considered multiple possible drugs
(imatinib, dasatinib, and nilotinib). The goal of the optimization
problem was to identify a sequence of drug exposures that led to a
minimal leukemic cell burden at the end of a fixed time interval.
2.3 Toxicity In cancer therapy, the goal is to achieve a maximal reduction in

Modeling tumor burden while keeping toxic side effects within acceptable
levels. Mathematical modeling can be used to understand the rela-
tionship between toxic side effects and treatment administration.
The control variables may be drug dosages or selections during the
course of therapy. Some existing models ignore the toxicity effects
by assuming that patients can tolerate the side effects during treat-
ment [25, 26]. However, it is often the case that patients are
required to go off drug for a period (drug holiday) due to severe
side effects such as grade 3–4 neutropenia [27]. Taking toxicity into
account brings an important phenomenon into the model and
allows for greater confidence when proposing treatment schedules
for the clinical setting.
One approach for modeling toxicity of cancer therapy is a
statistical approach that takes into account patient factors such as
immune system performance, loss of body weight, and side effects
experienced by patients [27–29]. For example, Sokal et al. [28]
developed a model to calculate the risk of drug toxicity during
treatment as a function of patient’s age and the number of platelet
and blast cells:
r ¼ exp(0.0116 (age 43.4) þ (spln 7.51) þ 0.188
[(pc/700)2 0.563] þ 0.0887 (bc 2.10))
where spln represents the spleen size, pc is the platelet count, and bc
is the number of blast cells. If r < 0.8, patient is in a low risk
protocol. If 0.8 r 1.2, patient is in an intermediate risk
protocol, and if r > 1.2, patient is in a high-risk protocol.
In [20], the authors proposed a mathematical model for the
prevention of excessive side effects in cancer chemotherapy. First,
the drug concentration at the cancer site at any time should be less
than a positive constant value vmax
0 v ðt Þ vmax
second, the total cumulative toxicity obtained by taking the integral

of drug concentration over the course of treatment should be less
than a positive constant value vcum
ðT
vðs Þds vcum
0
and third, it is also possible to limit the cumulative toxicity over a

window of time which is shorter than the total treatment time
T under threshold vdi
ð
tþdt
vðs Þds vdi

0
The above toxicity constraints are widely applied in cancer

optimization problems.
In our previous work [14], we propose a discrete optimization
model for studying optimal treatment regimens of CML. We
defined cytotoxic regimens as schedules resulting in low absolute
neutrophil count (ANC) values in patients at any time during the
therapy. We have built a simple mathematical model for the evolu-
tion of ANC levels under a variety of therapies and then used this
model to monitor the dynamics of the patient’s ANC in response to
each therapy protocol to ensure that the resulting toxicity in the
patient falls within acceptable ranges.
3 Robust Optimization for Patients with CML
In this section, we introduce our original work that develops a

dynamical model to study the optimal treatment protocol under
toxicity uncertainty in the context of a specific cancer type, CML.
CML is a cancer of the blood and bone marrow that is normally
caused by the oncogene BCR-ABL [30]. The treatment of CML
was transformed by the development of imatinib which is a selective
inhibitor of the chimeric protein Bcr-Abl (product of the oncogene
BCR-ABL). Initial clinical trials showed that the use of imatinib for
the treatment of CML resulted in rapid response in the majority of
patients [9, 31]. Despite a positive effect, around 20% of patients
who were treated with imatinib do not achieve a complete cyto-
genetic response (CCR) [44]. One possible cause of this is the
presence of imatinib resistant CML cells. In another study of
BCR-ABL mutations in CML patients, researchers report that
mutations were detected in 195/467 (41%) patients [32]. Several
new inhibitors, such as nilotinib and dasatinib, have been developed
to obtain an increased potency and a broader range of activity
against the known imatinib-resistant mutants [33]. Nilotinib has a
20–30-fold increase in potency over imatinib, while dasatinib
shows 100–300-fold higher potency than imatinib in vitro
[34]. Overall, these three drugs are promising in the treatment of
CML. An important issue that also needs to be considered is that
side effects arise due to drug toxicity, including low blood cell
count, fever, heart problems, as well as a number of other adverse
events [35–37]. Different patients may suffer different side effects
and even for the same patient, due to the change in health condi-
tion over time, side effects may vary over the course of treatment.
This complexity of the side effects induced by TKI therapy makes
the scheduling of treatment for CML challenging.
3.1 Nominal Problem In this section, we first introduce a series of ordinary differential
Formulation equations (ODE) that describes the dynamics of normal stem cells,
wild-type CML cells, and mutant CML cells in response to combi-
nation therapy. Then we explain how the toxicity associated with
treatment protocols quantified by monitoring ANC values during
treatment. Next, we propose a deterministic optimization problem
to find the best schedule of multiple therapies based on the evolution
of CML cells according to our ordinary differential equation model.
The resulting optimization problem is nontrivial due to the presence
of ordinary different equation constraints and integer variables. We
explain how the nominal problem can be solved efficiently.
3.1.1 CML Dynamics We use ODEs to describe the dynamics of stem cells for CML
patients over a given time period of M weeks. There are three
different types of stem cells: normal stem cells (NSC), wild-type
stem cells (WSC), and mutant stem cells (MSC). Let I ¼ { 1, 2, 3,
. . . , n} be the set of stem cell types, where types 1, 2, and i denote
NSC, WSC, and type (i 2) MSC (3 i n), respectively. Let
J ¼ {0, 1, 2, 3} be the set of drugs used to treat CML, where drug
0, 1, 2, and 3 denote a drug holiday, nilotinib, dasatinib, and
imatinib, respectively. Let M ¼ {1, 2, 3, . . . , M} be the set of
treatment periods and xi(t) the abundance of NSC, WSC, and
MSC at time t for i ∈ I, respectively. In this project, we assume
that Δt ¼ 7 days. If drug j ∈ J is taken for week m, the cell
dynamics are modeled as below:

j
x_ 1 ðt Þ ¼ b 1 ψ x 1 d x 1 , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g, ð4aÞ

j
x_ 2 ðt Þ ¼ b 2 ð1 ðn 2ÞμÞψ x 2 d x 2 , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g, ð4bÞ

j j
x_ i ðt Þ ¼ b i ψ x 2 d x i þ μb 2 ψ x 2 x 2 , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g, 3 i n, ð4cÞ
Here, we assume that the birth rates of the NSC, WSC, and
MSC are drug specific, but drugs do not affect the death rates of
stem cells and all the stem cells have the same death rate d. The
division rates of NSC, WSC, and MSC under drug j are b 1j , b 2j , and
j
b i per week, respectively. MSC are mutated from WSC with a
mutation rate μ. The competition between normal and leukemic
stem cells is modeled by the density dependence function ψ x i ,
!
P
n
where ψ x i ¼ 1= 1 þ pi x k ðt Þ . These functions ensure that the total
k¼1
number of normal and leukemic stem cells remains constant once
the system reaches a steady

state

[38]. We set the constants
b0 b0
p1 ¼ d1 1 =K 1 and p2 ¼ d2 1 =K 2 , where K1 and K2 are the
1 1
equilibrium abundance of NSC and WSC. In the equilibrium sys-

tem of NSC, we further assume only NSC is present. In the equi-
librium system of WSC, we assume only WSC is present. We also
assume that p2 ¼ pi (3 i n).
Note that in this model we focus solely on the stem cell layer
since our earlier work [14] thoroughly investigated the structure of
optimal schedules in a hierarchical population model, i.e., model
with multiple layers.
3.1.2 Toxicity Modeling In our previous work [14], we developed a model to quantify the
in Nominal Optimization ANC levels in patients during the course of therapy. Here we review
Problem this model. We assume the patient’s ANC level decreases at rate
danc , j per week taking drug j, for j ¼ 1 , 2 , 3. During drug holi-
day, ANC increases at rate danc , 0 per week but never exceeds the
normal level ANCnormal. At the same time, ANC should stay above
an acceptable threshold level Lanc. The ANC levels are modeled as:
!
X
m, j
y mþ1
¼ min y
m
d anc, j z ; ANCnormal
j ∈J
y m L anc
where ym is the ANC value at week m. zm , j, the binary decision

variables, indicate whether drug j is taken in week m or not, for each
j ¼ 0 , 1 , 2 , 3 and m ¼ 0 , 1 , . . . , M 1.
3.1.3 Nominal Assume that the initial population for each cell type is known. The
Optimization Problem goal of the nominal problem is to develop a treatment protocol to
minimize the cumulative leukemic cell number over a given
planning period subject to the toxicity constraints. The drug used
in each treatment cycle is determined by the weekly treatment
decision. Within each week, the dosing regimen stays identical on
a day-to-day basis. The cumulative leukemic cell numbers at time
t are P
modeled by the total number of WSC and MSC which is
x i, m , where xi , m ¼ xi(mΔt).
m∈M , i∈I \ f1g
The nominal optimization problem can be formulated as a
mixed-integer optimization problem with ODE constraints: details
are provided in Appendix 1.
3.2 Robust Problem A challenge of utilizing this optimization procedure in the clinical
Formulation setting is that parameters such as birth rates, death rates, and
toxicity decreasing rates in model (Eq. 7) may vary among patients.
Even for the same patient, due to changes in health status, these
parameters may vary over time. By modeling the uncertainty in
(Eq. 7), we investigate how parametric uncertainty affects the
optimal solution. Specifically, we consider the uncertainty of drug
toxicity in the model. The problem is formulated as a mixed integer
robust optimization problem. Our objective function is to mini-
mize the cumulative leukemic cell number over a fixed period. The
goal of the study is to investigate how the parameter uncertainty
affects the optimal solution.
3.2.1 Toxicity We primarily focus on uncertainty in the rate at which the ANC
Uncertainty level decreases under the different treatment options. In particular,
we assume
m b j m, j
,j ¼ L þ C η ð5Þ
j
d anc
where Lj is the lower bound of ANC decrease rates under drug j; C bj

a positive constant value and η m,j
is an unknown random variable
between 0 and 1, which is used to capture the uncertainty of drug
toxicity. First note that we can relax constraint (Eq. 8g) with the
following inequality:
X X X
y mþ1 bym Lj þ Cb j ηm, j z m, j ¼ b
ym L j z m, j b j ηm, j z m, j :
C
j ∈J j ∈J j ∈J
In robust optimization, we are interested in finding the best

solution that is feasible for all realizations of uncertain parameters
and we do not allow any violation of the toxicity constraint for any
parameters taking values in the sets (Eq. 5). Therefore, the robust
counterpart of the nominal problem associated with uncertainty
sets defined in Eq. 5 is found by solving
X
min x i, m ð6aÞ
m∈M , i∈I \ f1g
( )
X X
s:t: sup y mþ1
by m j m, j
Lz þ Cb η z jη ∈½0; 1 0
j m , j m , j m , j
ð6bÞ
j ∈J j ∈J
ð8bÞ, ð8cÞ, ð8dÞ, ð8eÞ, ð8f Þ, ð8hÞ,

ð6cÞ
ð8iÞ, ð8j Þ, ð8kÞ, ð8lÞ, ð8mÞ, ð8nÞ
Further mathematical details on the solution and derivation of

the robust optimization problem can be found in Appendix 2.
4 Example of Applying Modeling Methodology to CML Treatment
In this section, first we will describe the dataset and parameters that
were used in our numerical experiments, then the dynamics of the
CML cells under three mono-therapies will be simulated. Next, the
solution to the nominal and robust optimum drug schedule under

toxicity constraints will be explored. At the end of this section the
sensitivity of the optimal solution to model parameters and the
effect of the robust optimization on treatment outcome are
studied.
4.1 Parameter For our model, we assume there are two BCR-ABL-mutant cell types
Selection that are Y253F and F317L. We consider patients harboring three
different levels of the BCR-ABL mutant cells before the start of
therapy: low, medium, and high. The corresponding initial cell
populations are given in Table 1. The parameter settings for birth
rates and death rates in our model (Eq. 7) are given below. Based on
[38], we set death rate d to be 0.003. The net-growth rate
of NSC is
j
assumed to be 0.005. The net-growth rates of WSC b 2 under drug
holiday and mono-therapies are 0.008 and 0.002, respectively. We
assume that the net-growth rates of MSC under holiday
0 j
b i ; i ¼ 3; 4 are the same as b 02 which is 0.008. b i for i ¼ 3 , 4
and j 1 are estimated based on the work presented in [39] which
studied the in vivo mutational selectivity profile for mono-therapies.
We consider two mutant cell types in the model, i.e., Y 253F and
j
F317L. For Y 253F, the estimated values of b 3 are 0.0088, 0.0097,
and 0.0101 under nilotinib, dasatinib, and imatinib, respectively. For
j
F317L, the estimated values of b 4 are 0.0228, 0.0509, and
0.0079 under nilotinib, dasatinib, and imatinib, respectively. The
mutation rate of WSC is 107 [24]. We assume the equilibrium
abundance of NSC (K1) and WSC (K2) are 107 and 2 107,
respectively.
For toxicity constraints, we assume the patient’s normal ANC
level is Uanc ¼ 3000/mm3ANC and its ANC cannot fall below
Lanc ¼ 1000/mm3. We assume that the patient’s initial ANC is
3000/mm3. Based on the median time of grade 3 or 4 episode of
neutropenia, we estimated the weekly decrease rates of ANC as
danc , 1 ¼ 145.8333/mm3 under nilotinib [40], danc , 2 ¼
125/mm3 under dasatinib [41], and danc , 3 ¼ 56.4516/
mm3 under imatinib [10]. We assume that the ANC of a patient
increases by danc , 0 ¼ 500/mm3 during a drug holiday, before it
reaches the normal level 3000/mm3. In this project, we consider
two types of uncertainties: C b j ¼ 0:2 L j and Cb j ¼ 0:3 L j .
Table 1
Initial cell population conditions
NSC WSC Y253F F317L

Low 9.00 10 6
9.00 10 5
1.00 10 4
1.00 104
Medium 9.00 106 9.00 105 1.00 105 1.00 105
High 9.00 106 9.00 105 3.00 106 3.00 106
4.2 Cell Dynamics In this part, we present the dynamics of stem cells with the preex-
Simulations isting BCR-ABL mutation Y 253F and F317L under mono-
therapies. As reported, Y 253F is highly resistant to imatinib, lightly
resistant to nilotinib and sensitive to dasatinib; F317L is highly
resistant to dasatinib, and sensitive to imatinib and nilotinib. For
this simulation pattern, we expect that all monotherapies will fail
eventually because of the presence of the mutant cells and their
differentiated responses to drugs. The initial levels of NSC, WSC, Y
253F, and F317L are 9E þ 06, 9E þ 05, 1E þ 05, and 1E þ 05,
respectively.
Figure 1 plots the cell dynamics over 420 weeks (around
8 years) for six treatment protocols: nilotinib, dasatinib, and ima-
tinb mono-therapy, all of which are performed with and without
drug holiday. As F317L is resistant to dasatinib, the population of
F317L explodes around week 50 when administering dasatinib
[42]. On the other hand, we note that the population of Y 253F
is well controlled. In Fig. 2, we only look at the performances of
imatinb and nilotinib mono-therapy. The population of Y
253 increases over time, but the population size of F317L decreases
in both the cases. Those results indicate that drug combination may
be more effective for treating patients with multiple mutant cell
types.
Next, we discuss the results for nominal and robust optimiza-
tion problems. We first report the recurrence time of the optimal
schedule and mono therapies assuming that the toxicity parameters
are known, i.e., the nominal problem. In addition, we investigate
the recurrence time of the resulting optimal schedule when
Fig. 1 (a–e) Cell dynamics under mono-therapies for 420 weeks (mutant cell types: Y 253F and F317L)
Fig. 2 (a–e) Cell dynamics under mono-therapies (without dasatinib) for 420 weeks (mutant cell types: Y 253F
and F317L)
perturbing model parameters. Finally, we solve the robust optimi-

zation problem under different uncertainty settings and initial
conditions.
4.3 Nominal Optimal In this section, we are interested in the recurrence time for the two
Treatment Plans scenarios: mono-therapy and the nominal optimized therapies that
are achieved by solving the model presented in Appendix 1 for
360 weeks. The recurrence time is defined as the time at which
the tumor cell population returns to its size at the start of treat-
ment. The initial conditions for NSC, WSC, Y 253F, and F317L are
9E þ 06, 5E þ 05, 3E þ 05, and 3E þ 05, respectively. The
nominal optimal treatment plans are given in Fig. 3. The cell
growth is shown in Fig. 4. Since F317L is highly resistant to
dasatinib, we show the dynamics of tumor growth under dasatinib
only for 50 weeks. The results are summarized in Table 2. Under
the optimal schedule, the tumor size keeps decreasing, and thus
there is no recurrence time. We thus denote recurrence time by
NA. Under nilotinib, the tumor size reaches its minimal size at
week 88, then reaches the initial population size at week 183, and
doubles its size at week 261. Under imatinib, the tumor size
reaches the minimal size at week 63, reaches the initial population
size at week 130, and doubles its size at week 185. Under dasatinib,
the tumor size keeps increasing.
We also performed a sensitivity analysis on the nominal optimal
solution (shown in Fig. 3) with respect to the birth rates of mutant
Fig. 3 Optimal solution of the nominal problem for 360 weeks (mutant cell types: Y 253F and F317L). Digits
0, 1, 2, and 3 represent drug holiday, nilotinib, dasatinib, and imatinib
Fig. 4 Cell dynamics under mono-therapies and Optimal solution of the nominal problem for 360 weeks
(mutant cell types: Y 253F and F317L)
Table 2
Recurrence time for multiple mutants
To minimal Recurrence time Double the size

Nilotinib 88 183 261
Dasatinib 1 1 20
Imatinib 63 130 185
Drug combination NA NA NA
j
cells (b i for i ¼ 3 , 4 and j ¼ 1 , 2 , 3). We are interested in how
the recurrence time under schedule (shown in Fig. 3) changes as we
vary the birth rates of mutant cells. A 360-week simulation is run to
study the behavior of recurrence time. We consider two scenarios.
Fig. 5 The recurrence time with respect to the birth rate changes of Y253F and F317L under one drug when
the treatment protocols are fixed as the nominal optimal solution. (a–c) show the recurrence time when birth
rates of Y253F and F317L vary under nilotinib, dasatinib, and imatinib, respectively
Scenario one is that the birth rates of both mutant cells types vary
under only one drug, while the birth rates of mutant cells stay
constant under the other two drugs, i.e., if the drug affecting
birth rates is nilotinib, then b 13 and b 14 are set to be uniformly
distributed on [0.7, 1.3], while b 23 , b 24 , b 33 , and b 34 are fixed. The
other scenario is that the birth rates of one mutant cell type change
under all drugs, whereas the birth rates of the second mutant cell
type stay constant, i.e., the birth rates of Y 253F change under all
three drugs, while the birth rates of F317L stay the same under all
three drugs.
Figure 5 shows the results for scenario one. The colors indicate
different recurrence time as indicated by the colorbar, i.e., blue,
green, and red corresponding to a recurrence time of 0, 150, and
The
360, respectively. original birth rate of type (i 2) mutant cell
j
under drug j, b i , is given in Subheading 4.1. The varied birth
o j
rates of type (i 2) cell under drug j are represented by bi . The

j j
ratio, bi = b i is set to be uniformly distributed on [0.7, 1.3]. The
o
results in Fig. 5a, c indicate that the tumor size is below the initial
tumor size after 360 weeks using
the proposed method. The results
in Fig. 5b show that if b24 = b 24 o is >1.27,
recurrence happens before
the end of treatment. Recall that b 24 o , a positive value, is the
original net growth rate of F317L under dasatinib. As we increase

b24 = b 24 o , b24 increases which causes F317L grows faster under
dasatinib. However, overall we see that the optimal schedule
(shown in Fig. 3) is largely robust to changes in the birth rates of
the mutant cells.
Figure 6 shows the results for scenario two where birth rates of
F317L vary. For better visualization purposes, we fix the birth rates
under one drug while varying the birth rates under the other two,
and show the results of the recurrence time. Figure 6a indicates that
the ratio of F317L birth rate under nilotinib (drug 1) is fixed at

b14 = b 14 o ¼ 0:7 and the ranges of b24 = b 24 o and b34 = b 34 o are
uniformly distributed on [0.7, 1.3]. Columns 1, 2, and 3 show
j j
the recurrence time of tumor for fixed ratio of b4 = b 4 set at 0.7,
o
1.0, and 1.3, respectively. Figures in rows 1 (a, b, c), 2 (d, e, f), and
3 (g, h, i) correspond to j ¼ 1 , 2 , 3, respectively.
The figures in
the first row show that the increase in b14 = b 14 o is less likely to cause
the tumor reaching the initial size at the end of treatment. The
reason is that F317L is highly sensitive
to drug 1 which is nilotinib.
As we increase the ratio of b14 = b 14 o , the growth rate of F317L is
reduced. From Fig. 6a, we can see that under the extreme case
1 1
b4 = b 4 o ¼ 0:7, b24 = b 24 o ¼ 1:3, and b34 = b 34 o ¼ 0:7 , recurrences
happen around week 150. The result is consistent with the recur-
rence time reported in Fig. 6f, g. There is no recurrence when

b24 = b 24 o 1, but if b24 = b 24 o ¼ 1:3, recurrence happens in almost
half of the cases. Since F317L is sensitive to both nilotinib and
imatinib, the results in Fig. 6g–i are similar to the ones in Fig. 6a–c.
The difference
is that there is still a chance for tumor recurrence
when b34 = b 34 o ¼ 1:3, because nilotinib is applied more often com-
pared to imatinib in the nominal optimal solution.
4.4 Robust Optimal As we discuss in Appendix 2, protection levels (Γm) adjust the
Treatment Plans robustness of the proposed model against the conservation level
of the solution. In this part, we first compare the robust optimal
solutions under different protection levels, which are provided in
Appendix 3 for two monotherapies (imatnib and nilotinib). Fig-
ure 7 shows the dynamics of tumor growth for 30 weeks under
nilotinib, imtinib, and optimal solutions with different protection
b j ¼ 0:2 L j , and
levels (Fig. 8). For this simulation, we assume C
initial population sizes are 9E þ 06, 9E þ 05, 1E þ 05, and
1E þ 05, for NSC, WSC, Y 253F, and F317L, respectively. It is
interesting to note that the tumor sizes under the proposed meth-
ods at week 30 are lower than those predicted for either of the
Fig. 6 (a–i) The recurrence time with respect to the varied birth rates of F317L under three drugs when the
treatment protocols are fixed as the nominal optimal solution. Rows: constant birth rate set under nilotnib,
dasatnib, and imatinib, respectively. Columns: constant birth rate ratio set at 0.7, 1.0 and 1.3, respectively
Fig. 6 (continued)
mono-therapies even though our objective function aims to mini-

mize the cumulative tumor sizes.
Table 3 shows the robust optimal solutions under
b
C ¼ 0:2 L j for patients with initial tumor sizes at low, medium,
j
and high levels (Table 3). For example, if we take C b j ¼ 0:2 L j ,

for patients with an initial tumor size at low level, under zero
protection level, Γ ¼ 0, the optimal value is 2.41632 107. How-
ever, with full protection, Γ ¼ 3, the optimal value is increased by
0.541% to 2.42939 107. For patients with an initial tumor size at
medium level, under zero protection level, the optimal value is
2.9189 107. With full protection, the optimal value is increased
by 0.7044% to 2.9395 107. For patients with an initial tumor
size at high level, under zero protection level, the optimal value is
2.7933 107. With full protection, the optimal value is increased
by 1.3519% to 2.8311 107.
Figure 9 shows the increments of optimal values under differ-
ent protection levels for patients with initial tumor sizes at low,
medium, and high levels when assuming C b j ¼ 0:2 L j . The
Fig. 7 Cell dynamics under mono-therapies (without dasatinib) and optimal solutions for 30 weeks (mutant cell
types: Y 253F and F317L)
Y∗ ∗
Γ Y 0
increments are calculated by: Y∗
, where Y ∗ ∗
0 and Y Γ are the
0
optimal values of the nominal and robust optimization problems
under different protection levels, respectively. It is interesting to
note that the optimal value of the objective function increases as we
increase the protection level of robust solutions.
Next, we consider how the optimal treatment protocols are
affected by protection levels Γ and initial conditions of tumor
size. Figure 10a–c show the optimal treatment protocols for
Cb j ¼ 0:2 L j and initial tumor size at low (a), medium (b), and
high (c) levels. For initial tumor size at low level, as wild-type cells
dominate the total tumor size at the beginning of treatment, it is
efficient to reduce tumor size by taking the drug with the lowest
toxicity, which is drug 3. Recall that drug 0, 1, 2, and 3 represent
drug holiday, nilotinib, dasatinib, and imatinib. At the end of
treatment, as the number of mutant cells increases, it is necessary
to switch to dasatinib, which can reduce the number of Y 253F cells
efficiently. As we increase the protection level, more drug holidays
are needed, i.e., for unprotected optimal solutions (Γ ¼ 0), the
third break happens at the end of treatment, week 30, however for
Fig. 8 Robust optimal solutions under Cb j ¼ 0:2 Lj for 30 weeks. The initial conditions for NSC, WSC,
Y 253F, and F317L are 9E þ 06, 9E þ 05, 1E þ 05, and 1E þ 05, respectively.
Table 3
b j ¼ 0:2 L j : Multiple mutants
Robust solution for C
(a): Low level (b): Medium level (c): High level

Optimal value Increment Optimal value Increment Optimal value Increment
Γ (107) (%) (107) (%) (107) (%)
0 2.41632 0 2.9166 0 2.7933 0
0.05 2.41632 0 2.9192 0.0877 2.7957 0.0842
0.1 2.41750 0.049 2.9203 0.1250 2.7957 0.0842
0.2 2.41900 0.110 2.9261 0.3240 2.8004 0.2542
0.3 2.41948 0.131 2.9262 0.3287 2.8035 0.3626
0.4 2.42093 0.191 2.9265 0.3379 2.8061 0.4560
0.5 2.42367 0.304 2.9311 0.4940 2.8084 0.5386
1 2.42939 0.541 2.9403 0.8098 2.8288 1.2707
conservative solutions (Γ ¼ 3), it happens at week 25. Furthermore,

as the protection level increases, nilotinib and dasatinib are more
frequently used in the treatment to guarantee that optimal solu-
tions do not violate the toxicity constraint under different realiza-
tions of model parameters.
For patients with an initial tumor size at medium level, as the
mutant cell population increases, nilotinib and dasatinib appear
Fig. 9 Tumor size increments of optimal values under Γ with three different initial conditions
Fig. 10 Optimal solutions under Cb j ¼ 0:2 Lj with three different initial conditions: (a) initial tumor size at low
level; (b) initial tumor size at medium level; (c) initial tumor size at high level
more often in the optimal solution during the course of therapy.

Similarly, patients with an initial tumor size at medium level also
need to take longer breaks as we increase the protection level of the
toxicity constraint.
For patients with initial tumor size at high level, the popula-
tions of mutant cells dominate the tumor sizes. For the first
15 weeks, nilotinib is delivered to reduce the population size of
F317L. Note that the population size of Y 253F keeps increasing
during the first 15 weeks due to its resistance to nilotinib. To
control the size of Y 253F, dasatinib is administrated during the
last 15 weeks.
To investigate the effects of the size of the uncertainty ranges
on the optimal solutions, we perform simulation studies to see how
the structure of optimal schedules changes in the context of various
uncertainty ranges. Table 4 shows the robust optimal values under
Cb j ¼ 0:3 L j for patients with an initial tumor size at low,
medium, and high levels. Figure 11 shows the optimal solutions
Table 4
b j ¼ 0:3 Lj : Multiple mutants
Robust solution for C
(a): Low level (b): Medium level (c): High level

Optimal value Increment Optimal value Increment Optimal value Increment
Γ (107) (%) (107) (%) (107) (%)
0 2.4163 0 2.9167 0 2.7933 0
0.05 2.4178 0.0604 2.9197 0.1036 2.7957 0.0842
0.1 2.4178 0.0604 2.9220 0.1840 2.8000 0.2397
0.2 2.4221 0.2386 2.9251 0.2913 2.8050 0.4176
0.3 2.4222 0.2448 2.9281 0.3912 2.8102 0.6044
0.4 2.4260 0.4018 2.9313 0.5033 2.8107 0.6217
0.5 2.4289 0.5223 2.9341 0.5980 2.8219 1.0237
1 2.4366 0.8394 2.9524 1.2242 2.8443 1.8250
Fig. 11 Optimal solutions under Cb j ¼ 0:3 Lj with three different initial conditions: (a) initial tumor size at low
for patients with an initial tumor size at low, medium, and high
levels, respectively. These results are similar to those of
Cb j ¼ 0:2 L j . Hence, we can conclude that the structure of the
optimal solution is only mildly sensitive to the size of the uncer-
tainty range.
Next, we focus on comparing
j the differences that resulted
from
b j b
the uncertainty ranges C ¼ 0:3 L and C ¼ 0:3 L . From
j j
Fig. 12, we observe that: for patients with an initial tumor size at
low, medium, and high levels, the larger the toxicity uncertainty
ranges, the larger the optimal value.
The idea of imposing protection levels on robust optimization
is to use conservative constraints that guarantee no toxic side effects
occur. Here, we compare the performance of nominal solutions
versus robust optimization solutions in terms of objective function
and toxic side effects. We do this by randomly generating ANC
Fig. 12 The effects of uncertainty ranges under three different initial conditions: (a) initial tumor size at low
decay rates and comparing performance of robust and nominal

optimal solutions for the generated decay rates. We do this repeat-
edly and look at the average increase in leukemic cell burden that
results when using the robust optimal schedule, we also look at the
fraction of times we have toxic side effects when using the nominal
optimal solution. This process of repeatedly generating random
variables and averaging results is known as Monte-Carlo simula-
tion. To summarize, we use Monte-Carlo simulation to understand
how much greater the cumulative tumor size is under robust
optimization to guarantee that patients will not show toxic side
effects, and how much more toxicity (in terms of decreasing ANC)
patients will suffer if they are treated with the nominal therapy. Two
sets of simulations are conducted, corresponding to different
uncertainty ranges ( C b j ¼ 0:2 L j and C b j ¼ 0:3 L j ). For both
simulations, the initial populations for NSC, WSC, Y 253F, and
F317L are 9E þ 06, 5E þ 05, 3E þ 05, and 3E þ 05, respectively.
m
The ANC decrease rate d anc , j under drug j at mth week is randomly
generated by assuming η m,j
is uniformly distributed on [0, 1]. If
Table 5
Price of robust optimization
b j ¼ 0:2 Lj
C b j ¼ 0:3 Lj
C
Increments in OBJ Toxicity invalidation Increments in OBJ Toxicity invalidation
Γ (%) (%) (%) (%)
1 1.2707 0 1.8250 0
0.5 0.5386 49.98 1.0237 48.64
0.4 0.4560 61.80 0.6217 61.89
0.3 0.3626 69.18 0.6044 72.31
0.2 0.2542 79.07 0.4176 82.67
0.1 0.0842 94.08 0.2397 96.53
0.05 0.0842 94.08 0.0842 97.38
0 0 100 0 100
the ANC value is <Lanc during the simulation, then the patient
received a toxic side effect and the simulation is considered infeasi-
ble. The fraction of cases that are infeasible due to toxic side effects
is calculated by the total number of infeasible cases divided by the
total number of cases (106). The results are shown in Table 5.
Recall that Γ ¼ 0 is equivalent to the nominal problem. For both
simulations, as we increase Γ the objective value increases, while the
probability of toxicity violation decreases. The optimal solution
obtained by ROP seems to yield an interesting tradeoff between
the two objectives of minimizing cumulative tumor population and
the infeasibility of toxicity constraints beyond which allowing for
more risky regimens, i.e., using smaller Γ, does not lead to any
significant gain in objective function. In particular, if
Cb j ¼ 0:2 L j , then it appears that around Γ ¼ 0.2 there is a
sharp change in the fraction of runs that lead to toxic side effects
and a significant increase in objective value.
5 Conclusion
The major focus of this chapter was to introduce a mathematical

model for identifying optimal anti-cancer treatment strategies in the
presence of parameter uncertainty. These methods have great poten-
tial for designing and understanding optimal anti-cancer treatments.
Our general framework is to build a differential equation model for
the relevant cancer and normal cell populations undergoing a partic-
ular treatment. For many differential equations, it is necessary to
develop a linear approximation via a linear regression model. We next
develop a mathematical model for the most relevant toxicities in the
treatment we are studying. With these mathematical models in place

we are able to build a mathematical optimization model for identify-
ing optimal treatment schedules. In order to account for patient
variability, we make our model robust to inter-patient heterogeneity
in the rate at which side effects develop. We can then use software
solvers to identify treatment schedules that are optimal and robust.
In this chapter, we applied this methodology to study the treatment
of chronic myeloid leukemia (CML) with a variety of possible tyro-
sine kinase inhibitors (TKI).
There are several areas for improvement in our method. First,
we assume the drug dosages are constant, and therefore ignore the
possible benefits or risks of varying doses. Second, we assume that
there is no drug present in the patient from the previous treatment
when we switch to a new drug. In order to better characterize this
residual term detailed analysis of drug-drug interaction would be
necessary. Third our method only accounts for inter-patient varia-
bility in toxicity terms and not in cancer cell growth or death rates.
This is an important aspect of inter-patient variability that we plan
to further pursue. Finally, our method requires that we approxi-
mate the governing differential equations with a model that is linear
in the state. This prevents us from finding the true optimal solution;
furthermore, this approximation can be problematic in systems that
exhibit strongly nonlinear behaviors.
Acknowledgments
JZ was supported by NSF grant DMS-1224362. HB was sup-

ported by NSF grant CMMI-1362236. KL was supported by
NSF grants CMMI-1362236 and CMMI-1552764.
Appendix 1: Nominal Optimization Problem
The nominal optimization problem can be formulated as a MIOP

as below:
X
min x i, m ð7aÞ
m∈M , i∈I \ f1g
X
3
j
s:t: x_ 1 ðt Þ ¼ z m, j b 1 ψ x 1 d x 1 , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g ð7bÞ
j ¼0
X
3
j
x_ 2 ðt Þ ¼ z m, j b 2 ð1 ðn 2ÞμÞψ x 2 d x 2 , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g ð7cÞ
j ¼0
X
3
j j
x_ i ðt Þ ¼ z m, j b i ψ x 2 d x i þμb 2 ψ x 2 x 2 , t∈½mΔt; ðm þ1ÞΔt ,m∈M \ fM g,3 i n ð7dÞ
j ¼0
X
z m, j ¼ 1, m∈M \ fM g, ð7eÞ
j ∈J
X
y mþ1 ¼ b
ym d anc, j z m, j , m∈M \ fM g, ð7f Þ
j ∈J
b
y m ¼ minðy m ; ANCnormal Þ, m∈M , ð7gÞ
L anc b
y m, m∈M , ð7hÞ
z m, j ∈f0; 1g, m∈M \ fM g, j ∈J ð7iÞ
where x(0) , y0 are given. In Eqs. 7b, 7c, and 7d, the dynamics of
NSC, WSC, and MSC are described, respectively. Equations 7e,
and 7i indicate that during each week, only one type of drug or no
drug is allowed. Equations 7f, 7g, and 7h describe the toxicity
constraints.
As discussed in the previous work [26], the ODEs can be
approximated by linear functions:
X
min x i, m
m∈M , i∈I \ f1g
!
X
3
j
X
n
j
s:t: x i, mþ1 ¼ z m, j C i , 0 þ C i, k x k, m , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g
j ¼0 k¼1
X
z m, j ¼ 1, m∈M \ fM g
j ∈J
X
y mþ1 ¼ b
ym d anc, j z m, j , m∈M \ fM g
j ∈J
b
y m ¼ minðy m ; ANCnormal Þ, m∈M
L anc b
y m, m∈M
z m, j ∈f0; 1g, m∈M \ fM g, j ∈J
where x(0) , y0 are given.

There are two types of nonlinear terms here: zm , jxi , m and
bm
y ¼ minðy m ; ANCnormal Þ.
To linearize zm , jxi , m, we introduce a new variable
m, j
0 v i U i z m, j
m, j
U i 1 z m, j vi x i, m U i 1 z m, j
To linearize b
y m ¼ minðy m ; ANCnormal Þ, we introduce a binary
m
variable p
b
y m ANCnormal U y ð1 pm Þ,
b
y m y m U y pm ,
b
y m ym,
b
y m ANCnormal ,
pm ∈f0; 1g:
The nominal problem can be transformed into a MILP as
X
min x i, m ð8aÞ
m∈M , i∈I \ f1g
!
X
3
j
Xn
j m , j
s:t: x i, mþ1 ¼ z m, j C i , 0 þ C i, k vk , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g, ð8bÞ
j ¼0 k¼1
m, j
0 v i U i z m, j , ð8cÞ
m, j
U i 1 z m, j vi x i, m , ð8dÞ
m, j
v i x i , m U i 1 z m, j , ð8eÞ
X
z m, j ¼ 1, m∈M \ fM g, ð8f Þ
j ∈J
X
y mþ1 ¼ b
ym d anc, j z m, j , m∈M \ fM g, ð8gÞ
j ∈J
b
y m ANCnormal U y ð1 pm Þ, ð8hÞ
b
y m y m U y pm , ð8iÞ
b
y m ym, ð8jÞ
b
y m ANCnormal , ð8kÞ
p ∈f0; 1g,
m
ð8lÞ
L anc b
y m, m∈M , ð8mÞ
z m, j ∈f0; 1g, m∈M \ fM g, j ∈J , ð8nÞ
where x(0) , y0 are given.
Appendix 2: ROP Model
In this section, we describe the mathematical details of the robust

problem. We introduce parameters Γm that take values in the
bounded intervals [0,|Vm|], where Vm is the index sets of para-
meters with uncertainty. Γm is not necessarily an integer. The role
of parameters Γm is to adjust the robustness of the proposed model
against the conservation level of solution, thus it is called protection

level. The motivation of Γm is that it is unlikely that all the para-
meters with uncertainty vary at the same time and reach the maxi-
mal uncertainty. In other words, the model assumes that there
exists only a subset of the parameter drift that influence the solu-
tion. More specifically, it assumes that there are up to ⌊Γmc of
uncertainty parameters which are allowed to deviate from their
m
nominal values, and the toxicity decreasing rate d anc , j changes by
at most ðΓm bΓm cÞC b j , where ⌊Γmc is the greatest integer Γm.
Note that, if we choose Γm ¼ 0, we completely ignore the influence
of parameter uncertainty and are using the nominal values of the
uncertain parameters, and if we choose Γm ¼ Vm, then all the
uncertain parameters are subjected to deviate from their nominal
values. In this project, the maximum value of Γm is 3 since there are
three parameters with uncertainty. Note however that only one
drug is chosen for each period, parameter uncertainty of the other
two drugs will not affect the robust optimal solution. Thus, the
robust optimal solution under Γm ¼ 1 is exactly the same as the
ones under Γm > 1. The proposed robust counterpart of problem
(Eq. 6) is as follows:
X
min x i, m ð9aÞ
m∈M , i∈I \ f1g
X
s:t: y mþ1 b
ym þ L j z m, j
j ∈J
8 9
<X =
þ max b j z m, j þ ðΓm bΓm cÞC
C b t z m, t
m m
, ð9bÞ
C mRO :j ∈S m ;
ð8bÞ, ð8cÞ, ð8dÞ, ð8eÞ, ð8f Þ, ð8hÞ,

ð9cÞ
ð8iÞ, ð8j Þ, ð8kÞ, ð8lÞ, ð8mÞ, ð8nÞ
where C mRO ¼ fS m [ ft m gjS m V m ; jS m j bΓm c; t m ∈V m \ S m g,

Sm is the index sets of uncertain parameters which are allowed to
deviate from their nominal values. According to the method devel-
oped in [20], the maximization problem in Eq. 9b is equivalent to
the following auxiliary problem:
X
max Cb j ηm, j z m, j ð10aÞ
j ∈J
s:t: 0 ηm, j 1, j ∈J , ð10bÞ

X
ηm, j Γm , ð10cÞ
j ∈J
Equation 10c indicates that the total variation of the parameters

cannot exceed some threshold Γm. Notice that problem (Eq. 11) is
bounded. It is clear that ηm , j ¼ 0 is a feasible solution of (Eq. 11).

By strong duality, the optimal objective value of problem (Eq. 11) is
the same as the optimal objective value of its dual problem. It is easy
to check that the dual problem can be written as
X
max q m Γm þ pm, j ð11aÞ
j ∈J
s:t: b j z m, j þ qm þ pm, j 0,
C ð11bÞ
q 0,
m
ð11cÞ
pm, j 0, ð11dÞ
Thus, the optimal solution of our robust problem can be
obtained by solving the MILP:
X
min x i, m ð12aÞ
m∈M , i∈I \ f1g
!
X
3 Xn
m, j
m, j j j
s:t: x i, mþ1 ¼ z C i, 0 þ C i, k vk , t∈½mΔt; ðm þ 1ÞΔt , m∈M \ fM g, ð12bÞ
j ¼0 k¼1
m, j
0 v i U i z m, j , ð12cÞ
m, j
U i 1 z m, j vi x i, m , ð12dÞ
m, j
v i x i , m U i 1 z m, j , ð12eÞ
X
z m, j ¼ 1, m∈M \ fM g, ð12f Þ
j ∈J
X
y mþ1 b
ym L j z m , j q m Γm
j ∈J
X
pm , j , m∈M \ fM g, ð12gÞ
j ∈J
b
y m ANCnormal U y ð1 pm Þ, ð12hÞ
b
y m y m U y pm , ð12iÞ
b
y m ym, ð12jÞ
b
y ANCnormal ,
m
ð12kÞ
p ∈f0; 1g,
m
ð12lÞ
L anc b
y m, m∈M , ð12mÞ
z m, j ∈f0; 1g, m∈M \ fM g, j ∈J , ð12nÞ

b j z m, j þ q m þ pm, j 0,
C ð12oÞ
q 0,
m
ð12pÞ
pm, j 0, ð12qÞ
x ð0Þ, y 1 , b
y 1 are given, p1 ¼ 0 ð12rÞ
Appendix 3: Robust Optimal Solutions for 30 Weeks
In this section, we summarize all the robust optimal solutions

discussed in Subheading. 4.4 (see Figs. 13, 14, 15, 16, and 17).
Y 253F, and F317L are 9E þ 06, 9E þ 05, 1E þ 04, and 1E þ 04, respectively
References
1. Shi Z, Peng XX, Kim IW et al (2007) Erlotinib 13. O’Hare T, Eide CA, Deininger MWN (2007)
(Tarceva, OSI-774) antagonizes ATP-binding Bcr-Abl kinase domain mutations, drug resis-
cassette subfamily B member 1 and tance, and the road to a cure for chronic mye-
ATP-binding cassette subfamily G member loid leukemia. Blood 110:2242–2249
2-mediated drug resistance. Cancer Res 14. He Q, Zhu JF, Dingli D et al (2016) Opti-
67:1101220 mized treatment schedules for chronic myeloid
2. Paraiso KH, Xiang Y, Rebecca VW et al (2011) leukemia. PLoS Comput Biol 12:e1005129
PTEN loss confers BRAF inhibitor resistance 15. Harrold JM, Parker RS (2009) Clinically rele-
to melanoma cells through the suppression of vant cancer chemotherapy dose scheduling via
BIM expression. Cancer Res 71:27502760 mixedinteger optimization. Comput Chem
3. Foo J, Michor F (2014) Evolution of acquired Eng 33(12):2042–2054
resistance to anti-cancer therapy. J Theor Biol 16. Murray JM (1990) Some optimal control pro-
355:10 blems in cancer chemotherapy with a toxicity
4. Leder K, Foo J, Skaggs B et al (2011) Fitness limit. Math Biosci 100(1):49–67
conferred by BCR-ABL kinase domain muta- 17. Murray JM (1990) Optimal control for a can-
tions determines the risk of pre-existing resis- cer chemotherapy problem with general
tance in chronic myeloid leukemia. PLoS One growth and loss functions. Math Biosci
6(11):e27682. https://doi.org/10.1371/jour 98:273–287
nal.pone.0027682 18. Hadjiandreou MM, Mitsis GG (2014) Mathe-
5. Foo J, Leder K (2013) Dynamics of cancer matical modeling of tumor growth, drug-
recurrence. Annals Appl Probab 23 resistance, toxicity, and optimal therapy design.
(4):1437–1468 IEEE Trans Biomed Eng 61(2):415–425
6. Swanson KR, Bridge C, Murray JD et al (2003) 19. Laird AK (1964) Dynamics of tumour growth.
Virtual and real brain tumors: using mathemat- Br J Cancer 18(3):490–502
ical modeling to quantify glioma growth and 20. Martin RB (1992) Optimal control drug
invasion. J Neurol Sci 216(1):1–10 scheduling of cancer chemotherapy. Automa-
7. Badri H, Watanabe Y, Leder K (2015) Optimal tica 28:11131123
radiotherapy dose schedules under parametric 21. Floares A Neural networks control of drug
uncertainty. Phys Med Biol 61(1):338 dosage regimens in cancer chemotherapy.
8. Zhou C, Wu YL, Chen G et al (2011) Erlotinib SAIA, Cluj-Napoca, Transilvania
versus chemotherapy as first-line treatment for 22. Weisstein, Eric W. Heaviside step function.
patients with advanced EGFR mutation- MathWorld
positive non-small-cell lung cancer (OPTI-
MAL, CTONG-0802): a multicentre, open- 23. Afenya EK (2001) Recovery of normal hemo-
label, randomised, phase 3 study. Lancet poiesis in disseminated cancer therapy-a model.
Oncol 12(8):735–742 Math Biosci 172
9. Druker BJ, Talpaz M, Resta DJ et al (2001) 24. Michor F, Hughes TP, Iwasa Y et al (2005)
Efficacy and safety of a specific inhibitor of the Dynamics of chronic myeloid leukaemia.
BCR-ABL tyrosine kinase in chronic myeloid Nature 435:1267–1270
leukemia. N Engl J Med 344:1031–1037 25. Bozic I, Reiter JG, Allen B et al (2013) Evolu-
10. Kantarjian H, Sawyers C, Hochhaus A et al tionary dynamics of cancer in response to tar-
(2002) Hematologic and cytogenetic geted combination therapy. Elife 2:e00747
responses to imatinib mesylate in chronic mye- 26. Nanda S, Moore H, Lenhart S (2007) Optimal
logenous leukemia. N Engl J Med control of treatment in a mathematical model
346:645–652 of chronic myelogenous leukemia. Math Biosci
11. Cortes JE, Jones D, O’Brien S et al (2010) 210:143
Results of dasatinib in patients with early 27. O’Brien S, Berman E, Borghaei H et al (2009)
chronic-phase chronic myeloid leukemia. J NCCN clinical practice guidelines in oncology:
Clin Oncol 28(3):398–404 chronic myelogenous leukemia. J Natl Compr
12. Giles FJ, Abruzzese E, Rosti G et al (2010) Canc Netw 7(9):984–1023
Nilotinib is active in chronic and accelerated 28. Sokal JE, Cox EB, Baccarani M et al (1984)
phase chronic myeloid leukemia following fail- Prognostic discrimination in “good-risk”
ure of imatinib and dasatinib therapy. Leuke- chronic granulocytic leukemia. Blood
mia 24:1299–1301 63:789–799
29. Hasford J, Pfirrmann M, Hehlmann R et al 37. Marin D (2012) Initial choice of therapy
(1998) A new prognostic score for survival of among plenty for newly diagnosed chronic
patients with chronic myeloid leukemia treated myeloid leukemia. Hematology Am Soc
with interferon alfa. Writing Committee for the Hematol Educ Program 1:115–121
Collaborative CML Prognostic Factors Project 38. Foo J, Drummond MW, Clarkson B et al
Group. J Natl Cancer Inst 90:850–858 (2009) Eradication of chronic myeloid leuke-
30. Scheijen B, Griffin JD (2002) Tyrosine kinase mia stem cells: a novel mathematical model
oncogenes in normal hematopoiesis and hema- predicts no therapeutic benefit of adding
tological disease. Oncogene 21:3314 G-CSF to imatinib. PLoS Comput Biol 5(9):
31. Deininger MW, O’Brien S, Ford JM et al e1000503
(2003) Practical management of patients with 39. Gruber FX, Ernst T, Porkka K et al (2012)
chronic myeloid leukemia receiving imatinib. J Dynamics of the emergence of dasatinib and
Clin Oncol 21(8):1637–1647 nilotinib resistance in imatinib-resistant CML
32. Katia BBP, Israel B, Carla B et al (2015) patients. Leukemia 26:172–177
BCR-ABL mutations in Chronic Myeloid Leu- 40. Cortes JE, Jones D, O’Brien S et al (2010)
kemia treated with tyrosine kinase inhibitors Nilotinib as front-line treatment for patients
and impact on survival. Cancer Invest with chronic myeloid leukemia in early chronic
33:451–458 phase. J Clin Oncol 28(3):392–397
33. Ravin JG, Hagop K, Susan O et al (2009) The 41. Radich JP, Kopecky KJ, Appelbaum FR et al
use of nilotinib or dasatinib after failure to (2012) A randomized trial of dasatinib 100 mg
2 prior tyrosine kinase inhibitors: long-term versus imatinib 400 mg in newly diagnosed
follow-up. Blood 114(20):4361 chronic-phase chronic myeloid leukemia.
34. Wei G, Rafiyath S, Liu D (2010) First-line Blood 120(19):3898–3905
treatment for chronic myeloid leukemia: dasa- 42. Deininger M, Mauro M, Matloub Y et al
tinib, nilotinib, or imatinib. J Hematol Oncol (2008) Prevalence of T315I, dasatinib-specific
3:47 resistant mutations (F317L, V299L, and
35. Cornelison M, Jabbour EJ, Welch MA (2012) T315A), and nilotinib-specific resistant muta-
Managing side effects of tyrosine kinase inhibi- tions (P-loop and F359) at the time of imatinib
tor therapy to optimize adherence in patients resistance in chronic-phase chronic myeloid
with chronic myeloid leukemia: the role of the leukemia (CP-CML). Blood 112:3236
midlevel practitioner. J Support Oncol 10 43. Sawyers C (2004) Targeted cancer therapy.
(1):14–24 Nature 432:294–297
36. Conchon M, Freitas CM, Rego MA et al 44. Cortes J, Talpaz M, O’Brien S et al (2005)
(2011) Dasatinib - clinical trials and manage- Molecular responses in patients with chronic
ment of adverse events in imatinib resistant/ myelogenous leukemia in chronic phase treated
intolerant chronic myeloid leukemia. Rev Bras with imatinib mesylate. Clin Cancer Res
Hematol Hemoter 33(2):131–139 11:3425
Chapter 16
Modeling of Interactions between Cancer Stem Cells

and their Microenvironment: Predicting Clinical Response
Mary E. Sehl and Max S. Wicha
Abstract
Mathematical models of cancer stem cells are useful in translational cancer research for facilitating the
understanding of tumor growth dynamics and for predicting treatment response and resistance to com-
bined targeted therapies. In this chapter, we describe appealing aspects of different methods used in
mathematical oncology and discuss compelling questions in oncology that can be addressed with these
modeling techniques. We describe a simplified version of a model of the breast cancer stem cell niche,
illustrate the visualization of the model, and apply stochastic simulation to generate full distributions and
average trajectories of cell type populations over time. We further discuss the advent of single-cell data in
studying cancer stem cell heterogeneity and how these data can be integrated with modeling to advance
understanding of the dynamics of invasive and proliferative populations during cancer progression and
response to therapy.
Key words Breast cancer, Cancer stem cell, Mathematical model, Optimal therapy design
1 Introduction
Mathematical modeling of cancer stem cells has proven useful in

several important areas of translational cancer research. Those
include, for example: understanding evolutionary dynamics of
clonal populations and prediction of therapeutic resistance [1–3];
understanding tumor growth dynamics [4]; inferring the evolu-
tionary dynamics that occur during cancer initiation and progres-
sion [5]; understanding the dynamics of stem cell state transitions
and estimation of dedifferentiation rates [6, 7]; understanding the
complex regulatory pathways that modulate stem cell behavior; and
predicting clinical responses to combination therapies targeting the
cancer stem cell niche [7].
Based on predictions from modeling, clinical oncologists are
able to optimize dosing, frequency, and duration of therapies (e.g.,
dose dense treatments in adjuvant breast cancer therapies), which
increase efficacy and minimize side effects, leading to improved
333
334 Mary E. Sehl and Max S. Wicha
outcomes [8–10]. Statistical modeling has also proven valuable in

selecting prognostic and predictive markers in clinical trials in
translational oncology [11]. There are many opportunities where
modeling can expand its contribution to translational oncology. As
single-cell transcriptomics and epigenetic data become more readily
available and methods of simulation become more sophisticated,
multiscale modeling will permit the integration of data that will
inform models and improve predictions, which will ultimately lead
to more effective therapies. In this chapter, we will review the
methods that are currently being used in mathematical oncology,
and suggest areas where modeling could further be applied in
cancer stem cell systems biology research.
2 Compelling Research Questions in (Cancer) Stem Cell Research That Can Be

Addressed with Mathematical Modeling
2.1 Single-Cell Gene Single-cell sequencing and transcriptomics on a genome-wide level

Expression and has advanced greatly in recent years. Statistical methods have been
Epigenetic Data: How developed to analyze single-cell data in order to characterize tumor
to Extract Information heterogeneity [12, 13], demonstrate clonal evolution [14], and
to Best Inform infer phylogenetic relationships and ordering of mutations [15].
Models? Genetic and epigenetic patterns that emerge during the pro-
cesses of stem cell quiescence, activation, and differentiation can be
captured using single-cell analysis. Intra-tumoral heterogeneity
creates a challenge for the study of the interconnecting molecular
events that guide these processes. Single-cell gene expression anal-
ysis has been used to explore cell heterogeneity in breast cancer and
unravel gene expression variation in both cell line and patient-
derived xenograft samples [16, 17]. By examining expression levels
of 96 genes from pathways involved in cell self-renewal, adhesion,
and differentiation, three different patterns of expression in these
genes were observed in single cells obtained from cell lines and
from patient-derived xenograft samples. These patterns correspond
to three distinct cell populations: epithelial Breast cancer stem cells
(BCSCs), mesenchymal BCSCs, and non-stem cancer cells. Apply-
ing these methods to populations of circulating tumor cells will
allow for the characterization of cell types within a patient at
diagnosis and in response to treatment.
Whole transcriptome RNA-sequencing is used to transcrip-
tional events that are continuously changing within a cell over
time. Changes that are observed using this technology include
alternative gene spliced transcripts, post-transcriptional modifica-
tions, gene fusion, mutations, and alterations in gene expression.
Additionally, whole genome bisulfite sequencing is used to gener-
ate genome-wide analysis of DNA methylation. As these technolo-
gies become available within single-cell studies, sophisticated
methods will need to be developed to analyze these data and
Modeling Cancer Stem Cell Niche Dynamics 335
distinguish relevant patterns from inherent noise that is anticipated

within a single cell over time. Signaling pathways can be recon-
structed from genome, transcriptome, and proteome data
[18, 19]. While statistical inference has been successful in studying
these components individually, combining information from each
level is essential for understanding the system as a whole [20]. As
our understanding of cellular networks improves, these results can
be integrated with dynamic modeling approaches to estimate rates
of stem cell state transitions and to identify regulatory nodes. As
samples from circulating tumor cells from patients exposed to
therapeutic combinations become available, these methods could
be used to sort cell populations and track responses of each popu-
lation to therapy.
2.2 Modeling Cell- Because tumors consist of many cell types that interact with each
Cell Interactions other, as well as with the numerous cell types that are present in the
Between Cancer Stem tumor microenvironment, models that account for these interac-
Cells and Their tions are required. Evolutionary game theory has been useful in
Microenvironment modeling these interactions [1]. Models based on evolutionary
game theory have been employed to examine mechanisms of
growth control under conditions of competing resources [21],
and have predicted the evolution of cooperation among tumor
cells [22].
The breast cancer stem cell microenvironment consists of a
number of diverse cell types including more differentiated tumor
cells, stromal cells, endothelial cells, and immune cells. These cells
interact with each other through a number of signaling mechanisms
involving cytokines, growth factors, and other signaling molecules,
such as miRNAs [23–29].
Under normal conditions, the stem cell niche regulates how
stem cells participate in tissue generation, maintenance, and repair,
preventing stem cell depletion and overpopulation. The interaction
between these normal, tissue-specific stem cells and their niche is
required for balanced tissue maintenance, and aberrant function of
the niche may contribute to malignant transformation.
The cancer stem cell niche plays an important role in the
regulation of tumor growth, and metastasis as well as in modulating
therapeutic response. Here, we will describe the cellular elements of
the breast cancer stem cell niche.
Breast cancer stem cells, exist in either a proliferative, epithelial
state characterized by expression of ALDH as well as epithelial
markers such as E-cadherin, or in a quiescent, invasive, mesenchymal
state, characterized by expression of CD44 as well as additional
mesenchymal markers such as vimentin, N-cadherin, Twist, and
Slug and Snail [30]. When a BCSC is in the proliferative state, it
can undergo symmetric self-renewal, or asymmetric self-renewal,
giving rise to one identical copy of itself and one bipotent progeni-
tor cell [31, 32]. Alternatively, it can undergo symmetric
SC SC SC
SC SC SC
P P P
Symmetric self-renewal Asymmetric self-renewal Symmetric differentiation
Fig. 1 Types of stem cell division. A stem cell or stem-like cell can undergo symmetric self-renewal, giving
rise two identical copies of themselves, or asymmetric self-renewal, giving rise to one identical copy of itself
and one partially differentiated progenitor cell. It can also undergo symmetric differentiation, in which it gives
rise to two partially differentiated daughter cells
differentiation generating two bipotent progenitors (see Fig. 1).

Mathematical modeling has shown that slight disruption in the
balance between symmetric self-renewal, asymmetric self-renewal,
and symmetric differentiation can lead to Gompertzian growth
kinetics in tumors [7]. The bipotent progenitors give rise to either
luminal cells or basal cells. These differentiated cancer cells com-
prise the bulk of the tumor, and currently most cancer treatment
modalities are focused on this population. Other cells that are pres-
ent in the stem cell microenvironment include mesenchymal stem
cells that give rise to and maintain the stroma, endothelial cells that
reside in the tumor vasculature and various elements of the immune
system. In fact, recent studies have indicated that myeloid-derived
suppressor cells (MDSCs) are able to directly stimulate BCSC self-
renewal through the activation of the Notch pathway [33].
All of these cell types, as well as the microenvironmental signal-
ing pathways that guide their interaction, need to be considered in a
multiscale model of the breast cancer stem cell niche. These models
may be helpful in predicting patient responses to combinatorial
therapies targeting angiogenesis, for promoting activation of cancer
stem cells that are quiescent, and the prevention of invasion.
2.3 Relevance of Spatial organization is a key factor for growth and tissue renewal
Spatial Factors? during development and regeneration of healthy tissues [34]. It
was first observed in the germ stem cell niche of Drosophila mela-
nogaster that during cell division, the mitotic spindle is aligned with
support cells of the niche so that the daughter cell that remains
within the niche retains stem cell identity, whereas the daughter cell
that is displaced outside the niche (away from self-renewal signals)
initiates differentiation [35]. These oriented divisions have also
been observed in mammalian epithelia. For example, the position
of a stem cell within a hair follicle predicts whether it is likely to
remain committed, generate precursors, or progress to a different

fate [34]. Another example is that of stratified epithelial cells.
Alignment of the stem cell niche along rigid basal lamina leads to
regular morphologies, whereas alignment along a freely moving
basal lamina leads to distorted epithelial morphologies [36].
The dynamics of the stem cell niche have been well described in
the hematopoietic system.
Mathematical models designed to explore the mechanisms by
which stem cells communicate with the niche, as well as the fact that
cancer arises as a results of failure of this communication, have shown
that coupled lineages allow for more controlled regulation of total
blood cell numbers than uncoupled lineages and respond better to
random perturbation to maintain homeostatic equilibrium [37].
In a model of the breast cancer stem cell niche, it would be ideal
to also consider spatial effects. Spatial stochastic models have been
used to study cancer initiation and progression [38] as well as
mutational heterogeneity [39]. Spatial models have the potential
to be helpful for the optimization of therapies targeting the stem
cell niche.
2.4 Do Hypoxic The vasculature of tumors is very important in determining how

Microenvironments nutrients and drugs are delivered to tumor cells. Recent evidence
Promote Late from mouse xenograft studies demonstrates that hypoxia, mediated
Recurrence? by hypoxia-inducible factor 1α, drives the stem/progenitor cell
enrichment, and activates the Akt/β-catenin cancer stem cell regu-
latory pathway [40]. Hypoxia stimulates ALDH+ epithelial BCSCs,
located in the interior hypoxic zones of breast tumors, while the
invasive mesenchymal cells are located on the leading edge of the
tumor. Models that take into consideration the fractal geometric
properties of tumor vascular networks, as well as the spatial gradi-
ents in resources and metabolic states, have been used to predict
metabolic rates of tumors and derive universal growth curves to
predict growth dynamics in response to targeted treatments
[41]. Extensions of these growth equations including necrotic,
quiescent, and proliferative states have been used to understand
growth trajectories across tumor types. This type of modeling may
be ideally suited to answer questions related to the growth of stem
cell compartments in response to hypoxia, and for the selection of
combined, targeted treatments for the eradication of both quies-
cent and proliferative BCSCs. Another potential option would be
to use recent updates to stochastic simulation methods that include
spatial effects. Introducing the spatial aspects of the stem cell niche
into simulation is required to answer questions related to hypoxic
regulation of BCSC behavior.
2.5 Integration of The advent of immunotherapy has led to a dramatic shift in the
Immunotherapy with treatment and survival of several tumors, such as melanoma, renal
Molecularly Targeted cell carcinoma, lung cancer, and Hodgkin lymphoma
and Cytotoxic [42–49]. Approximately one-quarter of patients with triple negative
Therapies breast cancer respond to immunotherapy [50]. Immunotherapy is
particularly successful in aggressive malignancies, where the percent-
age of tumor-initiating cells is high. For example, in melanoma the
majority of tumor cells have capacity for self-renewal [51]. These
tumors were the first where immunotherapy was shown to be suc-
cessful. Immunotherapy, informed by mathematical modeling, may
have a greater chance of leading to durable remissions [52].
Successful immunotherapy should target stem-like cells as well
as bulk tumor cells. Mathematical modeling can be helpful in pre-
dicting the variable response to immunotherapy based on different
proportions of cell types comprising a tumor. These models are
especially relevant in the adjuvant setting, where tumor growth and
invasion are driven by a small number of cells on a longer time scale,
and where considerably more time and resources are required to
directly observe survival outcomes in relation to therapy. If immu-
notherapy is successful in activating the immune system to target
the stem cell compartment, it should eventually lead to eradication
of the tumor. However, the required duration of therapy required
to observe an appreciable change in bulk tumor size is unknown.
Stochastic models can be used to predict extinction times of the cell
populations comprising the tumor, allowing the estimation of the
treatment duration required to eradicate cancer cells [53]. Models
should also take into account the potential costs of immunotherapy,
including autoimmune side effects. These models would allow
selection of the optimal treatment dosing and duration that
would have the best the chance of tumor eradication while mini-
mizing the risk of side effects.
Another area in immunotherapy where mathematical modeling
may prove useful is in determining optimal combinations of thera-
pies. A branching process model has been used to predict success of
combination therapy under assumptions of mutations conferring
resistance [54]. In models combining cytotoxic chemotherapy,
vaccine therapy, CTLA4 and PD-1 inhibitors, and drugs targeting
the BRAF and MEK pathways and other molecular pathways
[55, 56], it will be important to model dosing and effectiveness in
order to address the need to minimize potentially debilitating side
effects, including autoimmune processes as well as the development
of secondary malignancies.
3 Mathematical Modeling and Simulation Tools in Translational Oncology
In silico experiments can be used in concert with cell line experi-

ments, animal xenograft model studies, and patient-oriented
translational studies to complement and improve cancer stem cell

research. Signaling networks in the cancer stem cell microenviron-
ment are complex and much work is needed to understand the
regulatory dynamics of this system. While gene knock-out experi-
ments allow the delineation of the importance of each individual
molecular component of the cancer stem cell microenvironment,
the combination of mathematical modeling with laboratory
research allows studying of the emergent properties and provides
a framework for elucidating the integrative dynamics of this com-
plex system [57–60].
Given the levels of complexity of the cancer stem cell niche,
selection of the most appropriate mathematical model remains
challenging. We will describe a variety of mathematical modeling
approaches and situations where specific methods can address this
challenge by providing important biological insights.
3.1 Defining the The breast cancer stem cell niche is a complex system comprised of
Model cancer stem cells, and the surrounding cells and molecular signals
that govern the behavior of the stem cells. Multiple overlapping
feedback loops regulate whether a cancer stem cell undergoes self-
renewal, quiescence, differentiation, or apoptosis. The niche also
regulates the rare event of partially differentiated breast epithelial
cancer cells undergoing dedifferentiation into a stem-like state.
The scope of a model is defined by the reactant species involved,
and by the reactions or events that take place. Examples of species
involved in the breast cancer stem cell niche include cancer stem
cells (quiescent and invasive versus proliferative), progenitor cells,
differentiated luminal and basal cells, endothelial cells, mesenchy-
mal cells, immune cells as well as the elements of signaling path-
ways, which regulate the transitions and interactions between these
cell types [26, 61]. Those signaling pathway elements include
cytokines (e.g., IL-6, IL-8, TGF-β, BMPs), receptors (e.g., HER2
and CXCR1), and intracellular signals, including protein kinases
(e.g., Akt), transcription factor proteins (e.g., Lin28, IκB, Stat3),
microRNA precursors (e.g., let-7), and microRNAs (e.g., mir-93)
[23–29]. The reactions of a model describe the important events
that change the abundance of reactant species. Examples of reac-
tions in the breast cancer stem cell niche include stem cell self-
renewal, quiescence, differentiation, and apoptosis. In general, a
model should be kept as simple as possible, adding sufficient com-
plexity to address the biological principles involved.
Figure 2 shows a simplified model of the state transitions that
occur between the proliferative epithelial (MET) state of breast
cancer stem cells (BCSCs) and their invasive quiescent mesenchy-
mal (EMT) state (for illustration, a small number of species and
reactions have been included here). The species include cell types
(EMT and MET states of the BCSCs) and the factors (cytokines
and intracellular signaling molecules) that regulate transitions
IL-6 gp130,
TGF-β TGF-βR2
Epithelial Mesenchymal
BCSC BCSC
mir-93, BMPs,
HER2 EGFR
Species: epithelial BCSC, mesenchymal BCSC, IL-6 and its receptor (gp130), TGF-
and its receptor (TGF- R2), mir-93, BMP, HER2 and its receptor (EGFR).
Reactions:
Receptor binding and dissociation
IL-6 + gp130 IL-6 gp130 IL-6 gp130 IL-6 + gp130

TGF- + TGF- R2 TGF- TGF- R2 TGF- TGF- R2 TGF- +
TGF- R2
HER2 + EGFR HER2 EGFR HER2 EGFR HER2 + EGFR
Stem cell state transitions
MET + IL-6 gp130 EMT + IL-6 gp130
MET + TGF- TGF- R2 EMT + TGF- TGF- R2
EMT + HER2 EGFR MET + HER2 EGFR
EMT + mir-93 MET + mir-93
EMT + BMP MET + BMP
Fig. 2 Schematic of microenvironmental signals governing BCSC state transitions. In this simplified model of
the BCSC niche, we identify the species involved, including cell types (the proliferative epithelial BCSCs and
the quiescent mesenchymal BCSC populations) and cytokines and intracellular signals that regulate transition
between these two states. The reactions included in our model directly or indirectly play a role in regulating
the BCSC state transitions
between these two states. Reaction types include receptor binding

and dissociation, as well as the state transitions. A more biologically
complete model also includes the regulatory feedback loops that
exist within this system, such as IL-6 activation of the Akt/Stat/
NFκB pathway leading to increased transcription of IL-6, and the
interaction of Lin-28/Let-7 and HER2 leading to activation of
β-catenin driving self-renewal of epithelial BCSCs. The inclusion
of such regulatory feedback loops thus enables the model to more
closely simulate responses to environmentally stressful conditions.
3.2 Deterministic Deterministic models can provide insight into many important
Versus Stochastic aspects of microenvironmental signaling, including the under-
Models standing of dynamic control (as revealed by time-course studies),
the impact of cellular cross-talk and identification of control points,
and an indication of possible target points for treatment, as well as

the exploration of dose-response relationships [62, 63]. There are
several software packages in Matlab® and Mathematica® that enable
investigators to explore nonlinear dynamics of complex systems
based on a series of reaction rate equations. Examples of these
include the Systems Biology Toolbox [40] in Matlab®, and Reac-
tionKinetics [64] in Mathematica®.
An example of a reaction rate equation, applied to our simpli-
fied model of stem cell state transitions, describes the rate of change
in E over time, the concentration of epithelial BCSCs and the rate
of change in M over time, the concentration of mesenchymal
BCSCs:
dE
¼ ðk1 y 1 þ k2 y 2 Þ E þ k3 y 3 þ k4 y 4 þ k5 y 5 ð1Þ
dt
dM
¼ k3 y 3 þ k4 y 4 þ k5 y 5 M þ ðk1 y 1 þ k2 y 2 Þ E ð2Þ
dt
where y1 through y5 are the concentrations of the microenviron-
mental factors (IL-6 l gp130, TGF-β l TGF-βR2, mir-93, BMP,
and HER) that interact with the two cellular species, and k1
through k5 are rate constants describing the impact of the interac-
tion of sets of species. In this simple model, we note that the rates of
change over time for E and M are related as follows:
dE dM
¼ : ð3Þ
dt dt
If symmetric self-renewal, a process that results in an increase in
the number of BCSCs was to be added into this mathematical
model, as well as apoptosis, which decreases the number of
BCSCs the system of equations would be:
dE
¼ ðk1 y 1 þ k2 y 2 Þ E þ β δ þ k3 y 3 þ k4 y 4 þ k5 y 5 M ð4Þ
dt
dM
¼ k3 y 3 þ k4 y 4 þ k5 y 5 M þ ðk1 y 1 þ k2 y 2 ÞE ð5Þ
dt
where β and δ are the rates of symmetric self-renewal and apoptosis,
respectively. In this case, the rate of change in epithelial and mesen-
chymal BCSCs would be equal only when β ¼ δ.
3.3 Visualizing the Petri nets are diagrams that are used in systems biology to describe
Model transitions and interactions that occur in complex systems [65]. In
these graphs, boxes represent the occurrence of transitions, ovals
represent species, and directed arcs delineate which reactant species
enter the reaction (i.e., arrow flows from species to reaction) and
products that are produced during the reaction (i.e., arrow flows
from the reaction to the species). Figure 3 shows the petri net
EM_Trans1
IL-6*gp130 EMT
IL-6_Translation IL-6*gp130_Unbinding ME_Trans2 ME_Trans
IL-6 gp130 HER2*EGFR MET mir93
IL-6_Binding HER2_Dissociation MET_Death HER2mRNA_Translation
EGFR HER2
HER2_Dimerization
Fig. 3 Petri net generated by the simplified model of factors regulating transitions between proliferative and
quiescent BCSC states. The Petri net demonstrates the interconnectivity of the model, defining its reactant
species (ovals) and the transitions and events (boxes) that relate them to each other
generated by our simplified model, which describes the state transi-

tions between the quiescent and proliferative BCSC states. While
Petri nets are based on strong mathematical foundation, they are
also helpful for use as a visual communication aid to understand
system behavior. The Petri net graphs in Fig. 3 were made using the
GraphViz package in the Julia language.
3.4 Stochastic In certain situations, stochastic models provide additional informa-

Simulation tion when approaching scientific questions in mathematical oncol-
ogy. Rare events, such as mutation and extinction, can be
accounted for with stochastic models, as can random fluctuations
in species counts that may greatly impact the population dynamics
of the system. Using probabilistic models, one is able to calculate
how frequently a population would become extinct under a given
condition or treatment, as well as the required duration of therapy
that would be needed to eradicate a stem cell population [58].
As the system gains increasing layers of complexity, more
sophisticated models are required and these models become diffi-
cult to solve analytically. In this scenario, stochastic simulation
techniques are helpful in studying niche dynamics where there can
Table 1
Propensity and stoichiometric change for two example reactions
Reaction Propensity Stoichiometric matrix

MET + IL-6 l gp130 ! EMT + IL-6 l gp130 c1*x1*x2 ν11 ¼ 1, ν12 ¼ 0, ν13 ¼ +1
EMT + HER2 l EGFR ! MET + HER2 l EGFR c2*x3*x4 ν21 ¼ +1, ν23 ¼ 1, ν24 ¼ 0
be large numbers of species and reactions and multiple overlapping

feedback loops. In the stochastic reaction kinetics framework, a
propensity must be specified for each reaction, as well as the net
change in count of each species. For our simplified example of the
stem cell state transitions, the species counts for epithelial BCSCs,
mesenchymal BCSC, IL-6 l gp130, and HER2 l EGFR would be x1
to x4, respectively, allowing the calculation of the propensity of each
reaction as well as the stoichiometric change in each species for each
reaction. Table 1 shows the propensity and stoichiometric change
for two example reactions.
Stochastic simulation algorithms proceed by updating the state
vector, which consists of particle counts for each of the reactant
species, after each reaction (or set of reactions) is allowed to fire. In
the stochastic simulation algorithm [66] the counts are updated
after each reaction fire. As a result, this algorithm is the most
accurate, but also the slowest. Approximate algorithms, such as
the τ-leaping algorithm, leap over a set of reactions, in which the
mean number of times a given reaction fires during the interval is
given by the product of its propensity and the length of the leap
interval [44]. While these methods increase computational speed,
they can compromise accuracy in situations where the propensity is
abruptly changing. An update to the τ-leaping algorithm, the step-
anticipation τ-leaping algorithm, allows the user to anticipate the
change in propensity during the leap and leads to improved accu-
racy without compromising speed [67]. Outputs from stochastic
simulation include full distributions of cell counts, as well as trajec-
tories of cell counts over time. Figure 4 shows the full distributions
(panel A) and mean trajectories (panel B) for epithelial BCSCs
while varying the rate of symmetric self-renewal of epithelial
BCSCs. Full distributions may be advantageous over the mean
trajectory when one is interested in the frequency with which a
population of cancer stem cells falls below a threshold of detectabil-
ity or when investigating how frequently that population is eradi-
cated in response to therapy.
3.5 3D Simulation Agent-based modeling is a microscale approach that combines

and Agent-Based elements of game theory, complex systems, emergence, and evolu-
Modeling tionary programming to simulate the actions and interactions of
individual cells and collective groups of cells to assess their effects
Epithelial BCSC frequency distributions
Epithelial BCSC Trajectories
Fig. 4 Sample output from stochastic simulation of stem cell state transitions. The first panel shows the full
distribution of epithelial BCSC cell counts over 1000 simulations for a fixed period of time. For slower birth
rates, BCSC cell populations reach smaller final counts. In the second panel, the average trajectories of
epithelial-like BCSC populations are shown. When the birth rate is faster, BCSC cell counts initially diminish in
response to therapy but later increase over time
on the system. They are particularly useful in accounting for details

of smaller levels of systems and the prediction of the appearance of
complex phenomena that occur at a higher level. Open source
simulation packages have recently become available that allow sim-
ulation of the behavior of millions of cells in three-dimensional
tissues. These methods have been applied to patient-calibrated
models of ductal carcinoma in situ to predict clinical progression
[68]. While these methods currently do not distinguish stem cell
states, it would be a useful extension of this approach to predict
bulk tumor response when cancer stem cells are therapeutically
targeted.
3.6 How to Integrate Mathematical modeling combined with experimental techniques in

Models with Data single-cell expression and epigenetic analysis represent a powerful
combination to understand the dynamics of the cancer stem cell
niche. An iterative approach is employed, where experimental data
are used to validate models and further inform mathematical mod-
eling parameters, and modeling predictions are used to guide
experiments and suggest new ones [69]. In situations where
known molecular mechanisms represented in the model are suffi-
cient to account for physiologic or cell biological phenomena, the
model can be used to explore the emergent system properties.
When there are additional phenomena not explained by molecular
mechanisms, the model could suggest new experiments to identify
additional molecular mechanisms to explain these phenomena.
We anticipate that single-cell genomic and transcriptomic
profiling will advance our understanding of intra-tumoral hetero-
geneity of cancer stem cells, the role of circulating CSC populations
during cancer development and tumor progression and in the
response to treatment. As our understanding of cellular interactions
within the tumor and its tissue microenvironment advances, we will
be able to design novel therapies that will more effectively target the
tumor microenvironment. The ability to track the evolution of the
cancer stem cell compartments in circulating tumor cells in
response to therapy will be particularly helpful in the adjuvant
setting where eradication of cancer stem cells is most critical.
Acknowledgments
Thanks are given to Jill Granger for manuscript review and editing.
This work was supported by grants RO1 CA101860 and R35
CA129765, NIH/NCATS UCLA CTSI Grant KL2TR000122,
and by the Breast Cancer Research Foundation
References
1. Nowak M (2006) Evolutionary dynamics: 13. Trinh A, Rye IH, Almendro V, Helland A,
exploring the equations of life. Harvard Uni- Russnes HG, Markowetz F (2014) Goifish: a
versity Press, Canada system for the quantification of single cell het-
2. Michor F (2008) Mathematical models of can- erogeneity from ifish images. Genome Biol
cer stem cells. J Clin Oncol 26:2854–2861 15:442
3. Foo J, Michor F (2014) Evolution of acquired 14. Hou Y, Song L, Zhu P, Zhang B, Tao Y, Xu X,
resistance to anti-cancer therapy. J Theor Biol Li F, Wu K, Liang J, Shao D, Wu H, Ye X, Ye C,
355:10–20 Wu R, Jian M, Chen Y, Xie W, Zhang R,
4. Weekes SL, Barker B, Bober S, Cisneros K, Chen L, Liu X, Yao X, Zheng H, Yu C, Li Q,
Cline J, Thompson A, Hlatky L, Hahnfeldt P, Gong Z, Mao M, Yang X, Yang L, Li J,
Enderling HA (2014) multicompartment Wang W, Lu Z, Gu N, Laurie G, Bolund L,
mathematical model of cancer stem cell-driven Kristiansen K, Wang J, Yang X, Wang J (2012)
tumor growth dynamics. Bull Math Biol Single-cell exome sequencing and monoclonal
76:762–782 evolution of a JAK2-negative myeloprolifera-
tive neoplasm. Cell 148:873–885
5. Beerenwinkel N, Schwarz RF, Gerstung M,
Markowetz F (2014) Cancer evolution: math- 15. Kim KI, Simon R (2014) Using single-cell
ematical models and computational inference. sequencing data to model the evolutionary his-
Syst Biol 0:1–24 tory of a tumor. BMC Bioinformatics 15:27
6. Gupta PB, Fillmore CM, Jiang G, Shapira SD, 16. Azizi E, Fouladdel S, Deol YS, Bender J,
Tao K, Kuperwasser C, Lander ES (2011) Sto- McDermott S, Jiang H, Sehl M, Clouthier
chastic state transitions give rise to phenotypic SG, Nagrath S, Wicha MS. Exploring cancer
equilibrium in populations of cancer cells. Cell stem cells heterogeneity via single cell multi-
146:633–644 plex gene expression analysis. Abstract 1943.
Proceedings: AACR 106th Annual Meeting
7. Sehl ME, Shimada M, Landeros A, Lange K, 2015; April 5–9th, 2014; San Diego, CA.
Wicha MS (2015) Modeling of cancer stem cell
state transitions predicts therapeutic response. 17. Azizi E, Jiagge EM, Fouladdel S, Wong S,
PLoS One 10:e0135797 Dziubinski ML, Sehl M, Kyani A, Li J,
Jiang H, Luther TK, Clouthier SG, McDer-
8. Norton L (2005) Conceptual and practical mott SP, Carpten J, Newman LA, Merajver
implications of breast tissue geometry: toward SD, Wicha M. Single cell multiplex gene
a more effective, less toxic therapy. Oncologist expression analysis to unravel heterogeneity of
10:370–381 PDX samples established from tumors of breast
9. Baldock AL, Rockne RC, Boone AD, Neal ML, cancer patients with different ethnicity.
Hawkins-Daarud A, Corwin DM, Bridge CA, Abstract 4834. Proceedings: AACR 106th
Guyman LA, Trister AD, Mrugala MM, Rock- Annual Meeting 2015; April 18–22, 2015;
hill JK, Swanson KR (2013) From patient- Philadelphia, PA.
specific mathematical neuro-oncology to preci- 18. Hwang D, Smith JJ, Leslie DM, Weston AD,
sion medicine. Front. Oncologia 3:62 Rust AG, Ramsey S, de Atauri P, Siegel AF,
10. Withers HR, Taylor JMG, Maciejewski B Bolouri H, Aitchison JD, Hood L (2005) A
(1988) Treatment volume and tissue tolerance. data integration methodology for systems biol-
Int J Radiat Oncol Biol Phys 14:751–759 ogy: experimental verification. Proc Natl Acad
11. Simon R, Altman DG (1994) Statistical aspects Sci U S A 102:17302–17307
of prognostic factor studies in oncology. Br J 19. Yeang CH, Ideker T, Jaakkola T (2004) Physi-
Cancer 69:979–985 cal Network Models. J Comput Biol
12. Almendro V, Cheng Y-K, Randles A, 11:243–262
Itzkovitz S, Marusyk A, Ametller E, 20. Markowetz F, Sprang R (2007) Inferring cel-
Gonzalez-Farre X, Munoz M, Russnes HG, lular networks – a review. BioMed Central Bio-
Helland A, Rye IH, Borresen-Dale AL, informatics 8(Suppl 6):S5
Maruyama R, van Oudenaarden A, 21. Gatenby RA, Vincent TL (2003) An evolution-
Dowsett M, Jones RL, Reis-Filho J, Gascon P, ary model of carcinogenesis. Cancer Res
Goenen M, Michor F, Polyak K (2014) Infer- 63:6212–6220
ence of tumor evolution during chemotherapy
by computational modeling and in situ analysis 22. Axelrod R, Axelrod DE, Pienta KJ (2006) Evo-
of genetic and phenotypic cellular diversity. lution of cooperation among tumor cells. Proc
Cell Rep 6:514–527 Natl Acad Sci U S A 103:13474–13479
23. Korkaya H, Kim GI, Davis A, Malik F, Henry 32. Cicalese A, Bonizzi G, Pasi CE, Faretta M,
NL, Ithimakin S, Quraishi AA, Tawakkol N, Ronzoni S, Giulini B, Brisken C, Minucci S,
D’Angelo R, Paulson AK, Chung S, Luther T, Di Fiore PP, Pelicci PG (2009) The tumor
Paholak HJ, Liu S, Hassan KA, Zen Q, Clou- suppressor p53 regulates polarity of self-
thier SG, Wicha MS (2012) Activation of an renewing divisions in mammary stem cells.
IL6 inflammatory loop mediates trastuzumab Cell 138:1083–1095
resistance in HER2+ breast cancer by expand- 33. Peng D, Tanikawa T, Li W, Zhao L, Vatan L,
ing the cancer stem cell population. Mol Cell Szeliga W, Wan S, Wei S, Wang Y, Liu Y,
47:570–584 Staroslawska E, Szubstarski F, Rolinski J,
24. Korkaya H, Liu S, Wicha MS (2011) Breast Grywalska E, Stanisławek A, Polkowski W,
cancer stem cells, cytokine networks, and the Kurylcio A, Kleer C, Chang AE, Wicha M,
tumor microenvironment. J Clin Invest Sabel M, Zou W, Kryczek I (2016) Myeloid-
121:3804–3809 derived suppressor cells endow stem-like quali-
25. Korkaya H, Liu S, Wicha MS (2011) Regula- ties to breast cancer cells through IL6/STAT3
tion of cancer stem cells by cytokine networks: and NO/NOTCH cross-talk signaling. Cancer
attacking cancer’s inflammatory roots. Clin Res 76:3156–3165
Cancer Res 17:6125–6129 34. Rompolas P, Mesa KR, Greco V (2013) Spatial
26. Liu S, Ginestier C, SJ O, Clouthier SG, Patel organization within a niche as a determinant of
SH, Monville F, Korkaya H, Heath A, stem cell fate. Nature 402:513–518
Dutcher J, Kleer CG, Jung Y, Dontu G, 35. Jones DL, Wagers AJ (2008) No place like
Taichman R, Wicha MS (2011) Breast cancer home: anatomy and function of the stem cell
stem cells are regulated by mesenchymal stem niche. Nat Rev Mol Cell Biol 9:11–21
cells through cytokine networks. Cancer Res 36. Ovadia J, Nie Q (2013) Stem cell niche struc-
71:614–624 ture as an inherent cause of undulating epithe-
27. Liu S, Clouthier SG, Wicha MS (2012) Role of lial morphologies. Biophys J 104:237–246
microRNAs in the regulation of breast cancer 37. Szekely T, Burrage K, Mangel M, Bonasall MB
stem cells. J Mammary Gland Biol Neoplasia (2014) Stochastic dynamics of interacting hae-
17:15–21 matopoietic stem cell niche lineages. PLoS
28. Deng L, Shang L, Bai S, Chen J, He X, Martin- Comput Biol 10:e1003794
Trevino R, Chen S, Li XY, Meng X, Yu B, 38. Komarova NL (2006) Spatial stochastic models
Wang X, Liu Y, McDermott SP, Ariazi AE, for cancer initiation and progression. Bull Math
Ginestier C, Ibarra I, Ke J, Luther T, Clouthier Biol 68:1573–1599
SG, Xu L, Shan G, Song E, Yao H, Hannon GJ, 39. Komarova NL (2007) Loss- and gain-of-func-
Weiss SJ, Wicha MS, Liu S (2014) Micro- tion mutations in cancer: mass-action, spatial
RNA100 inhibits self-renewal of breast cancer and hierarchical models. J Stat Phys
stem-like cells and breast tumor development. 128:413–446
Cancer Res 74:6648–6660
40. Conley SJ, Gheordunescu E, Kakarala P,
29. Liu S, Patel SH, Ginestier C, Ibarra I, Martin- Newman B, Korkaya H, Heath AN, Clouthier
Trevino R, Bai S, McDermott SP, Shang L, SG, Wicha MS (2012) Antiangiogenic agents
Ke J, SJ O, Heath A, Zhang KJ, Korkaya H, increase breast cancer stem cells via the genera-
Clouthier SG, Charafe-Jauffret E, tion of tumor hypoxia. Proc Natl Acad Sci U S
Birnbaum D, Hannon GJ, Wicha MS (2012) A 109:1784–1789
MicroRNA93 regulates proliferation and dif-
ferentiation of normal and malignant breast 41. Savage VM, Herman AB, West GB, Leu K
stem cells. PLoS Genet 8:e1002751 (2013) Using fractal geometry and universal
growth curves as diagnostics for comparing
30. Liu S, Cong Y, Wang D, Sun Y, Deng L, Liu Y, tumor vasculature and metabolic rate with
Martin-Trevino R, Shang L, McDermott SP, healthy tissue and for predicting responses to
Landis MD, Hog S, Adams A, D’Angelo R, drug therapies. Discr Cont Dyn Syst Ser B
Ginestier C, Charafe-Jauffret E, Clouthier SG, 18:1077–1108
Birnbaum D, Wong ST, Zhan M, Chang JC,
Wicha MS (2013) Breast cancer stem cell tran- 42. Pardoll DM (2012) The blockade of immune
sition between epithelial and mesenchymal checkpoints in cancer immunotherapy. Nat Rev
states reflective of their normal counterparts. Cancer 12:252–264
Stem Cell Rep 2:78–91 43. Hodi FS, O’Day SJ, DF MD, Weber RW, Sos-
31. Morrison SJ, Kimble J (2006) Asymmetric and man JA, Haanen JB, Gonzalez R, Robert C,
symmetric stem-cell divisions in development Schadendorf D, Hassel JC, Akerley W, van den
and cancer. Nature 441:1068–1074 Eertwegh AJ, Lutzky J, Lorigan P, Vaubel JM,
Linette GP, Hogg D, Ottensmeier CH,
Lebbé C, Peschel C, Quirt I, Clark JI, Wolchok MPDL3280A leads to clinical activity in
JD, Weber JS, Tian J, Yellin MJ, Nichol GM, patients with metastatic triple-negative breast
Hoos A, Urba WJ (2010) Improved survival cancer (TNBC). Proceedings: AACR 106th
with ipilimumab in patients with metastatic Annual Meeting 2015; April 18–22, 2015;
melanoma. N Engl J Med 363:711–723 Philadelphia, PA
44. Mellman I, Coukos G, Dranoff G (2011) Can- 51. Quintana E, Shackleton M, Sabel MS, Fullen
cer immunotherapy comes of age. Nature DR, Johnson TM, Morrison SJ (2008) Effi-
480:480–489 cient tumour formation by single human mela-
45. Luke JJ, Flaherty KT, Ribas A, Long noma cells. Nature 456:593–598
GV. Targeted agents and immunotherapies: 52. Walker R, Enderling H (2015) From concept
optimizing outcomes in melanoma. Nat Rev to clinic: mathematically informed immuno-
Clin Oncol. 2017 14 463 therapy. Curr Probl Cancer 40:68–83
46. Ribas A, Hamid O, Daud A, Hodi FS, Wolchok 53. Sehl M, Zhou H, Sinsheimer JS, Lange KL
JD, Kefford R, Joshua AM, Patnaik A, Hwu (2011) Extinction models for cancer stem cell
WJ, Weber JS, Gangadhar TC, Hersey P, therapy. Math Biosci 234(2):132–146
Dronca R, Joseph RW, Zarour H, 54. Robert L, Ribas A, Hu-Lieskovan S (2016)
Chmielowski B, Lawrence DP, Algazi A, Rizvi Combining targeted therapy with immuno-
NA, Hoffner B, Mateus C, Gergich K, Lindia therapy. Can 1+1 equal more than 2? Semin
JA, Giannotti M, Li XN, Ebbinghaus S, Kang Immunol 28:73–80
SP, Robert C (2016) Association of Pembroli- 55. Hu-Lieskovan S, Robert L, Homet Moreno B,
zumab With Tumor Response and Survival Ribas A (2014) Combining targeted therapy
Among Patients With Advanced Melanoma. with immunotherapy in BRAF-mutant mela-
JAMA 315:1600–1609 noma: promise and challenges. J Clin Oncol
47. Garon EB, Rizvi NA, Hui R, Leighl N, Balma- 32:2248–2254
noukian AS, Eder JP, Patnaik A, Aggarwal C, 56. Lu H, Clauser KR, Tam WL, Fröse J, Ye X,
Gubens M, Horn L, Carcereny E, Ahn MJ, Eaton EN, Reinhardt F, Donnenberg VS,
Felip E, Lee JS, Hellmann MD, Hamid O, Bhargava R, Carr SA, Weinberg RAA (2014)
Goldman JW, Soria JC, Dolled-Filhart M, breast cancer stem cell niche supported by jux-
Rutledge RZ, Zhang J, Lunceford JK, tacrine signalling from monocytes and macro-
Rangwala R, Lubiniecki GM, Roach C, phages. Nat Cell Biol 16:1105–1117
Emancipator K, Gandhi L (2015)
KEYNOTE-001 Investigators. Pembrolizu- 57. Bozic I, Reiter JG, Allen B, Antal T,
mab for the treatment of non-small-cell lung Chatterjee K, Shah P, Moon YS, Yaqubie A,
cancer. N Engl J Med 372:2018–2028 Kelly N, Le DT, Lipson EJ, Chapman PB,
Diaz LA Jr, Vogelstein B, Nowak MA (2013)
48. Ansell SM, Lesokhin AM, Borrello I, Evolutionary dynamics of cancer in response to
Halwani A, Scott EC, Gutierrez M, Schuster targeted combination therapy. Elife 2:e00747
SJ, Millenson MM, Cattry D, Freeman GJ,
Rodig SJ, Chapuy B, Ligon AH, Zhu L, Grosso 58. Sehl ME, Sinsheimer JS, Zhou H, Lange KL
JF, Kim SY, Timmerman JM, Shipp MA, (2009) Differential destruction of stem cells:
Armand P (2015) PD-1 blockade with nivolu- implications for targeted cancer stem cell ther-
mab in relapsed or refractory Hodgkin’s lym- apy. Cancer Res 69(24):9481–9489
phoma. N Engl J Med 372:311–319 59. Rodriguez-Brenes IA, Komarova NL, Wodarz
49. Motzer RJ, Escudier B, McDermott DF, D (2011) Evolutionary dynamics of feedback
George S, Hammers HJ, Srinivas S, Tykodi escape and the development of stem-cell-
SS, Sosman JA, Procopio G, Plimack ER, driven cancers. Proc Natl Acad Sci U S A
Castellano D, Choueiri TK, Gurney H, 108:18983–18988
Donskov F, Bono P, Wagstaff J, Gauler TC, 60. Behar M, Barken D, Werner SL, Hoffmann A
Ueda T, Tomita Y, Schutz FA, (2013) The dynamics of signaling as a pharma-
Kollmannsberger C, Larkin J, Ravaud A, cological target. Cell 155:448–461
Simon JS, LA X, Waxman IM, Sharma P 61. Sun Z, Komarova NL (2012) Stochastic mod-
(2015) CheckMate 025 Investigators. Nivolu- eling of stem-cell dynamics with control. Math
mab versus everolimus in advanced renal-cell Biosci 240:231–240
carcinoma. N Engl J Med 373:1803–1813 62. Mitchell S, Tsui R, Hoffmann A (2015) Study-
50. Leisha A. Emens, Fadi S. Braiteh, Philippe Cas- ing NF-kB signaling with mathematical mod-
sier, Jean-Pierre Delord, Joseph Paul Eder, els. Methods Mol Biol 1280:647–661
Marcella Fasso, Yuanyuan Xiao, Yan Wang, 63. Schmidt H, Jirstrand M (2006) Systems Biol-
Luciana Molinero, Daniel S. Chen and Ian ogy Toolbox for MATLAB: a computational
Krop. Abstract 2859: Inhibition of PD-L1 by
platform for research in systems biology. Bioin- 67. Gillespie DT, Petzold LR (2003) Improved
formatics 22:514–515 leap-size selection for accelerated stochastic
64. Nagy AL, Papp D, Toth J (2012) ReactionKi- simulation. J Chem Phys 119:8229–8234
netics-- a mathematica package with applica- 68. Macklin P, Edgerton ME, Thompson AM,
tions. Chem Eng Sci 83:12–23 Cristini V (2012) Patient-calibrated agent-
65. Peterson JL (1981) Petri net theory and the based modeling of ductal carcinoma in situ
modeling of systems. Prentice-Hall, Engle- (DCIS): from microscopic measurements to
wood Cliffs, NJ macroscopic predictions of clinical progression.
66. Gillespie DT (1977) Exact stochastic simula- J Theor Biol 301:122–140
tion of coupled chemical reactions. J Phys 69. Enderling H (2013) Unveiling stem cell kinet-
Chem 81:2340–2361 ics: prime time for integrating experimental
and computational models. Front Oncol 3:291
Chapter 17
Methods for High-throughput Drug Combination Screening

and Synergy Scoring
Liye He, Evgeny Kulesskiy, Jani Saarela, Laura Turunen,
Krister Wennerberg, Tero Aittokallio, and Jing Tang
Abstract
Gene products or pathways that are aberrantly activated in cancer but not in normal tissue hold great
promises for being effective and safe anticancer therapeutic targets. Many targeted drugs have entered
clinical trials but so far showed limited efficacy mostly due to variability in treatment responses and often
rapidly emerging resistance. Toward more effective treatment options, we will need multi-targeted drugs or
drug combinations, which selectively inhibit the viability and growth of cancer cells and block distinct
escape mechanisms for the cells to become resistant. Functional profiling of drug combinations requires
careful experimental design and robust data analysis approaches. At the Institute for Molecular Medicine
Finland (FIMM), we have developed an experimental-computational pipeline for high-throughput screen-
ing of drug combination effects in cancer cells. The integration of automated screening techniques with
advanced synergy scoring tools allows for efficient and reliable detection of synergistic drug interactions
within a specific window of concentrations, hence accelerating the identification of potential drug combi-
nations for further confirmatory studies.
Key words Drug combinations, High-throughput screening, Experimental design, Synergy scoring,
Computational modeling
1 Introduction
A pressing challenge in the development of personalized cancer

medicine is to understand how to make the most out of genomic
information from a patient when evaluating treatment options.
Over the past decade, there has been an extensive effort to
sequence cancer genomes in large patient cohorts, sparking expec-
tations to identify novel targets for more effective and selective
treatment opportunities. These sequencing efforts have revealed a
remarkable degree of genetic heterogeneity between and within
tumors, which partly explains why the traditional “one-size-fits-
all” anticancer treatment strategies have often produced disap-
pointing outcomes in clinical trials [1]. On the other hand,
351
352 Liye He et al.
functional studies using high-throughput drug screening allowed

linking cancer genomic vulnerabilities to targeted drug responses
[2–4]. However, complex genetic and epigenetic changes may lead
to re-activation of multiple compensatory pathways and to emer-
gence of treatment-resistant subpopulations (so-called cancer
clonal evolution).
Therefore, to reach effective and sustained clinical responses,
one often needs multi-targeted drugs or drug combinations, which
selectively inhibit multiple pathways in cancer cells [5, 6]. To facili-
tate discovery of effective drug combinations, preclinical studies
often rely on drug combination screening in cancer cell models.
Those serve as a starting point to prioritize the most promising hits
for further experimental investigation and therapy optimization.
Many of the existing drug combination studies, however, focus
on conventional chemotherapeutic drugs tested in a panel of cell
lines, for which the drug combination effects might not easily
translate into treatment options in the clinic (see (7)). In contrast,
primary cell cultures that are derived from patients have shown
tremendous potential that could enable the rapid assessment of
novel drugs or drug combinations at the individual level [8]. To
facilitate clinical translation, we have established at FIMM an Indi-
vidualized Systems Medicine (ISM) drug combination platform.
The ISM platform combines genomics, drug testing, and compu-
tational tools to predict drug responses for individual cancer
patients. The ISM platform has successfully been used to function-
ally profile primary leukemia, ovarian cancer, and prostate cancer
patient samples ex vivo so that the drug responses can be translated
to the in vivo setting [9–12].
The advances in high-throughput drug combination screening
have enabled the assaying of a large collection of chemical com-
pounds, generating dynamic dose-response profiles that allow us to
quantify the effect of drug combinations at an unprecedented level.
A drug combination is usually classified as synergistic, antagonistic,
or non-interactive. This classification is based on the deviation of
the observed drug combination response from the expected effect
of non-interaction (the null hypothesis). To quantify the degree of
drug synergy, several models have been proposed, such as those
based on the Highest single agent model (HSA) [13], the Loewe
additivity model (Loewe) [14], and the Bliss independence model
(Bliss) [15]. These existing drug synergy scoring models, together
with their software implementations, were initially proposed for
low-throughput experiments. In those experiments a limited num-
ber of drugs were combined with a fixed level of response, e.g., at
their IC50 concentrations. For example, CompuSyn has become a
popular tool to calculate a combination index (CI) using the Loewe
additivity model [16]. However, CompuSyn allows only for manual
input of one drug combination at a time, which makes it less
efficient for analyzing multiple drug combinations, particularly
Drug Combination Screening and Data Analysis 353
when the drug combinations are tested under various concentra-

tions, in a so-called dose-response matrix design.
To facilitate the data analysis of high-throughput drug combi-
nation screens, more recent tools have been made available as R
implementations (https://www.R-project.org). For example, mix-
low is an R package which utilizes a nonlinear mixed-effects model
to calculate the CI [17]. However, mixlow works only for an
experimental design where the ratio of two drugs in a combination
is fixed over all tested concentrations. Therefore, it may not be
directly applicable for a dose-response matrix design, where the
ratios of two drugs vary. Another R package, called drc, provides
an URSA (universal response surface approach) model, which is
more suitable for dose-response matrix data [18]. URSA extends
the Loewe model by considering the response surfaces over all the
tested concentrations. In contrast to the CI, which is defined at a
fixed response level, the URSA model provides a summarized drug
interaction score from the whole dose-response matrix. However,
the URSA implementation in the drc package often leads to fitting
errors when the dose responses fail to comply with the model
assumptions. To evaluate the appropriateness of URSA, one needs
to trace back to its underlying theoretical paper [19]. The Bliss
model has also been extended recently by incorporating the
response surface concept, similar as in the URSA model, based on
which a contour plot of a Bliss interaction index can be constructed
[20]. We have recently developed a response surface model, called
Zero Interaction Potency (ZIP), which combines the Loewe and
the Bliss models, and proposed a delta score to characterize the
synergy landscape over the full dose-response matrix [21].
Here, we describe an experimental-computational drug com-
bination analysis pipeline that has been widely used in Finland and
elsewhere to test and score effects of drug combinations in cancer
cells [22–24]. The pipeline includes both an experimental protocol
for dose-response matrix drug combination assays, as well as
computational tools to facilitate the plate design and synergy mod-
eling. The pipeline is applicable not only to cancer cell lines but also
to patient-derived cancer samples for individualized drug combina-
tion optimization. With the increasing size of our compound
library, including compounds that target all the known cancer
survival pathways, the drug combination discovery can now be
targeted toward more personalized anticancer treatment. We first
describe the experimental protocol including a computer program,
called FIMMcherry, which enables efficient production and visuali-
zation of combination assay plates, the output of which can be
directly exported to the robotic system for automated dispensing.
To address the lack of tailored software tools for high-throughput
drug combination scoring, we here report a new R-package, Syner-
gyFinder, which provides efficient implementations for all the pop-
ular synergy scoring models, including HSA, Loewe, Bliss, and ZIP.
354 Liye He et al.
This implementation provides the lab users with more flexibility to

explore their drug combination data. We expect that the use of
SynergyFinder will greatly improve the interpretation of the drug
combination results and may eventually lead to the standardization
of preclinical drug combination studies.
2 Materials
2.1 Cell Culture 1. Established cancer cell lines can be purchased from multiple
vendors (see Note 1).
2. Patient-derived samples are obtained with permission from
Finnish biobanks, hospitals, and clinical collaborators [2].
3. Cell media, serum and supplements recommended by cell line
providers.
4. Trypsin-EDTA.
5. HyQTase.
6. CellTox Green Cytotoxicity reagent (Promega).
7. CellTiter-Glo or CellTiter-Glo 2.0 reagent (Promega).
8. 384-well tissue culture treated sterile assay plates.
9. MicroClime Environmental Lids.
10. Beckman Coulter Biomek FXP for dispensing primary cells,
which tend to grow as aggregates.
11. Plate reader.
2.2 Drug 1. FIMMcherry software (see Note 2).

Combination Plate 2. Source plate file in text format.
Design
3. Drug combination file in text format.
4. Compound library (see Note 3, Fig. 1).
5. Labcyte Echo 550 acoustic dispenser for dispensing compounds
in precise volume with high accuracy (2.5 nL).
6. Storage pods.
2.3 Phenotypic 1. CellTox Green Cytotoxicity Assay.

Readouts 2. CellTiter-Glo or CellTiter-Glo 2.0 Assay.
3. MultiFlo FX Multi-Mode Dispenser with RAD module or Mul-
tidrop Combi Reagent Dispenser for dispensing growth media,
CellTiter-Glo reagents and seeding cells.
4. Plate shaker.
5. PHERAstar FS or Cytation 5 Cell Imaging Multi-Mode plate
readers for CellTox Green (fluorescence) and CellTiter-Glo
(luminescence) detection on 384-well plates.
A) Compound class B) Clinical stage
Conv. Chemo (n=74) Kinase inhibitor (n=262)

Rapalog (n=5) Immunomodulatory (n=14) Approved (n=156) Investigational (n=279) Probe (n=90)
Differentiating/ epigenetic modifier (n=61) Hormone therapy (n=22)
Apoptotic modulator (n=22) Metabolic modifier (n=17)
Kinesin inhibitor (n=3) Nonsteroidal anti-inflammatory drug (n=2)
HSP inhibitor (n=9) Other (n=34)
Fig. 1 An overview of the FIMM oncology compound collection. The drug combination platform enables the
testing of pairwise drug combinations from 525 small-molecular anticancer compounds that cover mainly
kinase inhibitors and other signal transduction modulators. About half of the compounds comprised in the
library are either FDA-approved or being evaluated in clinical trials at different stages
2.4 Software Tools 1. R.

for Data Analysis 2. Bioconductor.
3. SynergyFinder package (see Notes 4–6).
4. csv file that describes a drug combination dataset.
3 Methods
The drug combination analysis pipeline starts from sample prepara-

tion and compound selection, based on which an automated plate
design program called FIMMCherry is utilized. The drug sensitiv-
ity and resistance is then profiled in the plate by cell viability,
cytotoxicity, and other readouts. The resulting dose-response
matrix data is analyzed with the SynergyFinder R package for the
detection of synergistic drug combinations (see Fig. 2).
3.1 Cell Culture 1. Dissociate cells by adding 0.05% trypsin-EDTA or HyQTase to

achieve a single-cell suspension.
2. Titrate cells to define optimal density within exponential growth
(log phase). Seed cells in twofold serial dilution starting from
16,000 cells/well on 384-well plates. For most cell lines, the
optimal cell number is in the range of 500–2000 cells/well.
356 Liye He et al.
A DrugA dose (nM)

0 1 3 10 30 100 3001000
DrugB dose (nM)
10000
3000
1000
300
100
30
10
0
B
40 40
Synergy score
40 30 30
30 20 20
20 10 10
10 0 0
0 −10 −10
−10 −20 −20
−20 −30 −30
−30 −40 −40
−40
10000 10000 10000
3000 3000 3000
1000 1000 1000
300 1000 300 1000
300 1000 300 100 300
100 300 100 100 100
100 30 30 30 30
30 10
30
10 3 10
10 3 10
10 3 1 1
1
Non-interactive Antagonistic Synergistic
Fig. 2 An overview of the drug combination data analysis. (a) A typical high-throughput drug combination
screen utilizes a dose-response matrix design where all possible dose combinations for a drug pair can be
tested. Colors in the dose-response matrices show different levels of phenotypic responses of the cancer cell
with red indicating stronger inhibition and green indicating lower inhibition. (b) Depending on the interaction
pattern models derived from the dose-response matrices, a drug combination can be classified as
non-interactive, antagonistic, or synergistic
3. Cell toxicity and viability detection after 72 h of incubation

using CellTox Green and CellTiter-Glo reagents. Add 5 μL of
culture medium to pre-drugged 384-well assay plates using
MultiFlo FX Multi-Mode Dispenser with RAD module or Mul-
tidrop Combi Reagent Dispenser and shake the plates for
15 min. If toxicity measurement is performed, include 1:2000
dilution of CellTox Green reagent. Seed cells at optimal density
to pre-drugged assay plates using MultiFlo FX Multi-Mode
Dispenser with RAD module or Multidrop Combi Reagent
Dispenser in 20 μL of culture medium. Culture cells for 72 h
at 37 C in the presence of 5% CO2. Measure the amount of
dead cells, stained by the CellTox Green reagent, using a plate
reader with fluorescence mode. For viability measurement, add
25 μL of CellTiter-Glo reagent to assay plates using MultiFlo FX
Multi-Mode Dispenser with RAD module or Multidrop Combi
Reagent Dispenser. Shake the plates for 5 min and subsequently
spin the plates at 218 g for 5 min. Measure the CTG signal in
the assay wells using a plate reader with luminescence mode.
4. MicroClime Environmental Lids are used to minimize edge
effect and to keep concentrations of solutions constant.
3.2 Drug We utilize a combination plate layout where six compound pairs can
Combination Plate be accommodated on one 384-well plate. A given pair of drugs is
Design combined in a series of one blank and seven half-log dilution
concentrations, resulting in an 8 8 dose matrix. To be able to
transfer the compounds according to this matrix format, a pick list
defining the source and destination plate locations and transfer
volumes for the compounds is needed. An in-house program, called
FIMMCherry, has been developed to automatically generate these
rather complex pick lists effortlessly (see Note 7).
Two tab-delimited text files are needed as input:
1. A source plate file provides information of the compound stocks
(compound identification, available concentration ranges,
source plate identification, and well identification).
2. A drug combination file containing the selected compound
pairs.
After loading the input files, FIMMCherry will show the layout
of the plates accordingly (Fig. 3). A pick list that is compatible with
the Labcyte Echo dispenser is then created by the program for
compound dispensing. The Labcyte Echo 550 acoustic dispenser
transfers liquid from source wells to destination wells in a
non-contact fashion in 2.5 nL droplets. The pick list generated
above is compatible with the Echo Cherry Pick software without
further modifications to produce the pre-drugged assay plates [10].
1. The compounds are dissolved in DMSO except for 19 drugs
(e.g., platinum drugs) with poor DMSO solubility or stability
that are instead dissolved in water. All 525 compounds are
transferred in five doses on eight 384-well plates.
2. The pre-dosed plates are stored in Storage Pods under nitrogen
gas at room temperature for up to 1 month.
3. For quality control, a regular quality check-up of our compound
library is performed which includes the testing of the com-
pounds with four assay-ready cell lines (DU4475, HDQ-P1,
IGROV-1, and MOLM-13) every 2 months. Following the
time-dependent reproducibility of the drug responses allows us
to precisely detect any changes in the compound stability and
activity.
3.3 Viability 1. Transfer 5 μL of media with CellTox Green Cytotoxicity reagent

Readouts into a 384-well containing the pre-diluted compound library (see
Note 8).
358 Liye He et al.
Fig. 3 Drug combination plate design using FIMMCherry. The graphical user interface contains a virtual plate
enabling an interactive way of designing the plate. After loading the input files including the source, the
control, and drug pair information (the black inset boxes), the selected drug combinations and their dose
ranges will be listed in the “Drug Pair” tab, for which an echo file will be generated for acoustic dispensing.
Each plate can be visualized in a separate tab and will be named by its plate identifier (the red inset box). The
“Info” tab shows the liquids consumption in the source plates (the yellow inset box)
2. Shake the plate on the plate shaker at 450 rpm for 5 min for
proper drug dissolving.
3. Transfer a single-cell suspension in 20 μL of media to a 384-well
plate. Final dilution of CellTox Green reagent should be 1:2000
in 25 μL.
4. Incubate the cells in the plates for 72 h.
5. Shake the plates on the plate shaker at 500 rpm for 30 s. Read
fluorescence in the plates using a plate reader for CellTox Green
Cytotoxicity detection.
6. Transfer 25 μL of CellTiter-Glo reagent to the plate.
7. Shake the plates on the plate shaker at 450 rpm for 5 min and
spin the plate at 218 g for 5 min.
8. Read luminescence in the plates for detecting cell viability using
a plate reader.
3.4 Synergy Scoring: 1. Download and install R (https://www.R-project.org).

Installation of the 2. Download and install Bioconductor (https://www.bio
SynergyFinder conductor.org/).
R-package
3. Install the SynergyFinder package by typing in the R console as
below:
> source(“https://www.bioconductor.org/biocLite.R”)
> biocLite(“synergyfinder”)
4. Load the package:
> library(synergyfinder)
3.5 Synergy Scoring: 1. A single csv file that describes a drug combination dataset is
Input Data provided as input. The csv file is in a list format and must contain
the following columns:
l BlockID: the identifier for a drug combination. If multiple
drug combinations are present, e.g., in the standard 384-well
plate where six drug combinations are fitted, then the identi-
fiers for each of them must be unique.
l Row and Col: the row and column indexes for each well in
the plate.
l DrugCol: the name of the drug on the columns in a dose-
response matrix.
l DrugRow: the name of the drug on the rows in a dose-
response matrix.
l ConcCol and ConcRow: the concentrations of the column
drugs and row drugs in combination.
l ConcUnit: the unit of concentrations. It is typically nM or
μM.
l Response: the effect of drug combinations at the concentra-
tions specified by ConcCol and ConcRow. The effect must be
normalized to %inhibition of cell viability or proliferation
based on the positive and negative controls. For a well-
controlled experiment, the range of the response values is
expected from 0 to 100. However, missing values or extreme
values are allowed. For input data where the drug effect is
represented as %viability, the program will internally convert
it to %inhibition value by 100-%viability.
2. We provide example input data in the R package, which is
extracted from a recent drug combination screen for treatment
of diffuse large B-cell lymphoma (DLBCL) [7]. The example
input data contains two representative drug combinations
360 Liye He et al.
(ibrutinib and ispinesib and ibrutinib and canertinib) for which

the %viability of a cell line TMD8 was assayed using a 6 by 6 dose
matrix design. The example data in the required list format can
be loaded and reshaped to a dose-response matrix format for
further analysis by typing:
> data(“mathews_screening_data”)
> dose.response.mat <- ReshapeData(mathews_screening_data,
data.type ¼ “viability”)
3. The “data.type” parameter specifies the type of drug response,

which can be either “viability” or “inhibition.” We will use these
example data to illustrate the main functions of SynergyFinder
below. More documentation of the input and output parameters
for each function can be accessed by typing:
> help(‘ReshapeData’)
3.6 Synergy Scoring: 1. The input data can be visualized using the function PlotDoseR-
Input Data esponse by typing:
Visualization
> PlotDoseResponse(dose.response.mat)
2. The function fits a four-parameter log-logistic model to generate

the dose-response curves for the single drugs based on the first
row and first column of the dose-response matrix. The drug
combination responses are also plotted as heatmaps. From
those, one can assess the therapeutic significance of the combi-
nation, e.g., by identifying the concentrations at which the drug
combination can lead to a maximal effect on the inhibition of
cancer cell survival/proliferation (see Fig. 4). The PlotDoseRe-
sponse function also provides a high-resolution pdf file by add-
ing the “save.file” parameter:
> PlotDoseResponse(dose.response.mat, save.file ¼ TRUE)
3. The pdf file will be saved under the current work directory with
the syntax: “drug1.drug2.dose.response.blockID.pdf.”
3.7 Synergy Scoring: 1. The current SynergyFinder package provides the synergy scores
Drug Synergy Scoring of four major reference models, including HSA, Loewe, Bliss,
(See Notes 9 and 10) and ZIP. In a drug combination experiment where drug 1 at
dose x1 is combined with drug 2 at dose x2, the effect of such a
combination is yc as compared to the monotherapy effect y1(x1)
and y2(x2). To be able to quantify the degree of drug interac-
tions, one needs to determine the deviation of yc from the
A Dose−response curve for drug: ispinesib in Block 1 Dose−response matrix (inhibition)

61 BlockID: 1
60
Inhibition (%)
59
2500 54.21 61.96 75.5 84.91 93.17 92.2
58
57
56
55
625 60.63 61.95 76.74 85.9 93.45 94.07
54 Inhibiton (%)
10 50 200 1000
ispinesib (nM)
75
156.2 60.76 66.48 74.07 87.57 92.39 94.28
Concentration (nM)
50
Dose−response curve for drug: ibrutinib in Block 1 25

39.1 60.25 63.13 70.9 86.49 92.94 93.72
0
60
Inhibition (%)
40
9.8 59.22 63.63 71.96 88.24 85 91.05
20
0
0 −22.96 −3.76 −18.14 41.17 53.33 71.3
−20
0.2 1 5 20
Concentration (nM)
0 0.2 0.8 3.1 12.5 50
ibrutinib (nM)
B Dose−response curve for drug: canertinib in Block 2 Dose−response matrix (inhibition)

BlockID: 2
40
Inhibition (%)
20
0 2500 53.02 58.96 64.31 67.29 66.19 86.71
−20
−40
−60 625 −51.48 −40.63 −48.98 −58.41 −21.65 36.12
−80
Inhibiton (%)
10 50 200 1000
canertinib (nM)
156.2 −81.18 −42.65 −35.38 −48.34 −0.4 59.2 50

Concentration (nM)
0
Dose−response curve for drug: ibrutinib in Block 2
39.1 −87.24 −78.45 −56.94 −18.56 34.89 72.84
−50
Inhibition (%)
60
40 9.8 −53.88 −36.33 −35.28 2.93 55.31 73.44
20
0 −15.04 4.76 10.43 59.69 68.31 76.97
0.2 1 5 20
Concentration (nM)
0 0.2 0.8 3.1 12.5 50
ibrutinib (nM)
Fig. 4 Plots for single-drug dose-response curves and drug combination dose-response matrices. (a) The
ibrutinib and ispinesib combination. (b) The ibrutinib and canertinib combination. Left panel: single drug dose-
response curves fitted with the commonly-used 4-parameter log-logistic (4PL) function. Right panel: the raw
dose-response matrix data is visualized as a heatmap
expected effect ye of non-interaction, which is calculated in

different ways with the individual reference models.
l HSA: ye is the effect of the highest monotherapy effect, i.e.,
ye ¼ max(y1, y2).
l Loewe: ye is the effect that would be achieved if a drug was
combined with itself, i.e., ye ¼ y1(x1 + x2) ¼ y2(x1 + x2).
362 Liye He et al.
l Bliss: ye is the effect that would be achieved if the two drugs are
acting independently of the phenotype, i.e., ye ¼ y1 + y2 y1y2.
l ZIP: ye is the effect that would be achieved if the two drugs
do not potentiate each other, i.e., both the assumptions of
the Loewe model and the Bliss model are met.
2. Once ye can be determined, the synergy score can be calculated
as the difference between the observed effect yc and the expected
effect ye. Depending on whether yc > ye or yc < ye the drug
combination can be classified as synergistic or antagonist,
respectively. Furthermore, as the input data has been normalized
as %inhibition, the synergy score can be directly interpreted as
the proportion of cellular responses that can be attributed to the
drug interactions.
3. For a given dose-response matrix, one needs to first choose
which reference model to use and then apply the CalculateSy-
nergy function to calculate the corresponding synergy score at
each dose combination. For example, the ZIP-based synergy
score for the example data can be obtained by typing:
> synergy.score <- CalculateSynergy(data ¼ dose.response.

mat, method ¼ “ZIP”, correction ¼ TRUE)
4. For assessing the synergy scores with the other reference models,
one needs to change the “method” parameter to “HSA,”
“Loewe,” or “Bliss.” The “correction” parameter specifies if a
baseline correction is applied on the raw dose-response data or
not. The baseline correction utilizes the average of the minimum
responses of the two single drugs as a baseline response to
correct the negative response values. The output “synergy.
score” contains a score matrix of the same size to facilitate a
dose-level evaluation of drug synergy as well as a direct compar-
ison of the synergy scores between two reference models.
3.8 Synergy Scoring: 1. The synergy scores are calculated across all the tested concentra-
The Drug Interaction tion combinations, which can be visualized as either a
Landscape two-dimensional or a three-dimensional interaction surface
over the dose matrix. The landscape of such a drug interaction
scoring is very informative when identifying the specific dose
regions where a synergistic or antagonistic drug interaction
occurs. The height of the 3D drug interaction landscape is
normalized as the % inhibition effect to facilitate a direct com-
parison of the degrees of interaction among multiple drug com-
binations. In addition, a summarized synergy score is provided
by averaging over the whole dose-response matrix. To visualize
the drug interaction landscape, one can utilize the PlotSynergy
function as below (see Fig. 5):
A
ZIP synergy score: 18.042 ZIP synergy score: 18.042
−40 −20 0 20 40 −40 −30 −20 −10 0 10 20 30 40
2500
40
625
30
20
Inhibition (%)
ispinesib (nM)
156.2
10
0
−10
−20
39.1
−30
−40
2500
9.8
625
isp
156.2 50
in
12.5
es
39.1
ib
3.1
(n
0.8 )
(nM
M
0 0.2 0.8 3.1 12.5 50 9.8
)
0.2 tinib
ibru
ibrutinib (nM)
B
ZIP synergy score: −16.339 ZIP synergy score: −16.339
−40 −20 0 20 40 −40 −30 −20 −10 0 10 20 30 40
2500
40
625
30
20
Inhibition (%)
canertinib (nM)
156.2
10
0
−10
−20
39.1
−30
−40
2500
9.8
625
ca
156.2 50
ne
12.5
rti
39.1
ni
3.1
b
(n
9.8 0.8 M)
ib (n
M
0 0.2 0.8 3.1 12.5 50

)
0.2 tin
ibru
ibrutinib (nM)
Fig. 5 The drug interaction landscapes based on the ZIP model. (a) The ibrutinib and ispinesib combination. (b)
The ibrutinib and canertinib combination
> PlotSynergy(synergy.score, type ¼ “all”, save.file ¼ TRUE)
2. The “type” parameter specifies the visualization type of the

interaction surface as 2D, 3D, or both.
364 Liye He et al.
4 Notes
1. Examples of cell lines include four cell lines that are used for
quality check of the compound library: DU4475 (breast can-
cer), HDQ-P1 (breast cancer), IGROV-1 (ovarian cancer), and
MOLM-13 (acute monocytic leukemia).
2. Specific software tools are needed in the experimental design
stage and in the data analysis stage. For the 384-well plate
design, once the drugs and the concentration ranges are
selected, we use the in-house cherry-picking program, FIMM-
cherry, to automatically generate the echo files needed for the
Labcyte Access system.
3. The FIMM oncology collection contains both FDA/EMA-
approved drugs and investigational compounds (see Fig. 1).
The collection is constantly evolving and the current FO4B
version contains 525 compounds with concentrations ranging
typically between 1 and 10,000 nM. For some compounds, the
concentration range is adjusted upward (e.g., platinum drugs,
100,000 nM) or downward (e.g., rapalogs, 100 nM) to better
match their relevant concentrations of bioactivity. The full list
of the FIMM oncology compounds can be found in
Supplementary Table 1.
4. When the drug combination dose-response matrix data is
ready, we then use the SynergyFinder R-package to score and
visualize the drug interactions. The SynergyFinder is also avail-
able as a web-application without the need to install the R
environment.
5. The SynergyFinder package will be continuously updated for
including more rigorous analyses such as statistical significance,
effect size, and noise detection.
6. Availability: The source code for the FIMMCherry program is
available at github (https://github.com/hly89/FIMM-
Cherry). The SynergyFinder R package for drug combination
data analysis is available at CRAN and Bioconductor.
7. FIMMCherry is a desktop GUI application, which is developed
using Python (https://www.python.org/) and Qt application
development framework (https://www.qt.io/). The integra-
tion of Python and Qt allows FIMMCherry to run on all the
major computer platforms including Windows, Linux, and
Mac OS X.
8. We have not seen problems in cell proliferation rate or other
major effects when using the reagent. The reagent is stable at
least 72 h in the cell culture and the cells dying at the beginning
of the 72 h incubation are still stained after 72 h.
Table 1
The FIMM oncology compound collection
High phase High
DRUG_ Mechanism Class approval Trade Supplier Conc.
NAME targets explained status Alias names Supplier Ref Solvent (nM)
SN-38 Active metabolite of A. Conv. (approved) BR-36613, 7 ChemieTek CT-SN38 DMSO 10000
irinotecan. Chemo -Ethyl-10-
Topoisomerase I hydroxy
inhibitor camptothecine
Idarubicin Topoisomerase II A. Conv. Approved Zavedos, Sigma-Aldrich I1656 DMSO 1000
inhibitor Chemo Idamycin
Auranofin Antirheumatic agent A. Conv. Approved Sigma-Aldrich A6733 DMSO 2500
Chemo
Plicamycin RNA synthesis inhibitor A. Conv. Approved Mithramycin A Santa Cruz sc-200909-5 DMSO 10000
Chemo Biotechnology
Bortezomib Proteasome inhibitor A. Conv. Approved MS-341 Velcade, National Cancer NSC 681239- DMSO 1000
(26S subunit) Chemo Cytomib Institute L/9
Clofarabine Antimetabolite; Purine A. Conv. Approved Evoltra, National Cancer NSC 606869- DMSO 10000
analog Chemo Clolar Institute X/4
Lomustine Alkylating nitrosourea A. Conv. Approved CCNU, CeeNU National Cancer NSC 79037- DMSO 10000
compound Chemo Institute R/12
Vincristine Mitotic inhibitor. Vinca A. Conv. Approved Selleck S1241 DMSO 1000
alkaloid microtubule Chemo
depolymerizer
Vinorelbine Mitotic inhibitor. Vinca A. Conv. Approved Selleck S4269 DMSO 10000
alkaloid microtubule Chemo
depolymerizer
Altretamine Formaldehyde release, A. Conv. Approved National Cancer NSC 13875- DMSO 10000
alkylating agent Chemo Institute O/97
Vinblastine Mitotic inhibitor. Vinca A. Conv. Chemo Approved National Cancer NSC 49842- DMSO 1000
alkaloid microtubule Institute J44
depolymerizer
Chlorambucil Nitrogen mustard A. Conv. Chemo Approved National Cancer NSC 3088- DMSO 10000
alkylating agent Institute N/6
Drug Combination Screening and Data Analysis
Dacarbazine Alkylating agent A. Conv. Chemo Approved National Cancer NSC 45388- DMSO 10000
Institute R/74
Cyclophosphamide Alkylating agent A. Conv. Chemo Approved Selleck S1217 DMSO 40000
365
(continued)
Table 1
366
(continued)
High phase High

Cytarabine Antimetabolite, interferes A. Conv. Chemo Approved Ara-C National Cancer NSC 63878- DMSO 10000
with DNA synthesis Institute P/19
Liye He et al.
Fluorouracil Antimetabolite A. Conv. Chemo Approved 5-fluorouracil, 5-FU National Cancer NSC 19893- DMSO 10000
Institute G/4
Ifosfamide Nitrogen mustard A. Conv. Chemo Approved National Cancer NSC 109724- DMSO 10000
alkylating agent Institute X/4
Melphalan Nitrogen mustard A. Conv. Chemo Approved Sigma-Aldrich M2011 AQ 12500
alkylating agent
Mitoxantrone Topoisomerase II A. Conv. Chemo Approved National Cancer NSC 279836- DMSO 1000
inhibitor Institute C/2
Paclitaxel Mitotic inhibitor, taxane A. Conv. Chemo Approved Taxol National Cancer NSC 125973- DMSO 1000
microtubule stabilizer Institute L/68
Procarbazine Alkylating agent A. Conv. Chemo Approved National Cancer NSC 77213- DMSO 10000
Institute K/6
Topotecan Topoisomerase I A. Conv. Chemo Approved National Cancer NSC 609699- DMSO 10000
inhibitor. Institute Y/16
Camptothecin analog
Temozolomide Alkylating agent A. Conv. Chemo Approved National Cancer NSC 362856- DMSO 100000
Institute R/31
Mechlorethamine Nitrogen mustard A. Conv. Chemo Approved Nitrogen mustard Mustargen Sigma-Aldrich 122564 DMSO 100000
alkylating agent
Mitotane Antineoplastic agent A. Conv. Chemo Approved National Cancer NSC 38721- DMSO 10000
Institute U/3
Allopurinol Xanthine oxidase A. Conv. Chemo Approved Zyloprim National Cancer NSC 1390-R/ DMSO 10000
inhibitor Institute 3
Busulfan Alkylating antineoplastic A. Conv. Chemo Approved Sigma-Aldrich B2635 DMSO 100000
agent
Hydroxyurea Antineoplastic agent A. Conv. Chemo Approved Myelostat Sigma-Aldrich H8627 DMSO 100000
Mercaptopurine Antimetabolite A. Conv. Chemo Approved 6-mercaptopurine, National Cancer NSC 755-Z/ DMSO 10000
6-MP Institute 13
Thioguanine Antimetabolite; Purine A. Conv. Chemo Approved 6-thioguanine, National Cancer NSC 752-W/ DMSO 10000
analog 6-TG Institute 47
Carmustine Alkylating agent A. Conv. Approved BCNU National Cancer NSC 409962- DMSO 10000
Chemo Institute T/3
Thio-TEPA Alkylating agent A. Conv. Approved Sigma-Aldrich T6069 DMSO 50000
Chemo
Pipobroman Alkylating agent A. Conv. Chemo Approved National Cancer NSC 25154- DMSO 10000
Institute X/2
Raltitrexed DHFR/GARFT/ A. Conv. Chemo Approved ICI-D 1694 Tomudex Medchemexpress HY-10821 DMSO 1000
thymidylate synthase
inhibitor
Irinotecan Topoisomerase I A. Conv. Chemo Approved Camptosar LC Laboratories I-4122 DMSO 10000
inhibitor.
Camptothecin
prodrug analog
Nelarabine Nucleoside analog, DNA, A. Conv. Chemo Approved Arranon, SequoiaResearch SRP003328n DMSO 10000
RNA synth inhibitor Atriance Products
Docetaxel Mitotic inhibitor, taxane A. Conv. Chemo Approved Taxotere, LC Laboratories D-1000 DMSO 1000
microtubule stabilizer Docecad
Pentostatin Antimetabolite; Purine A. Conv. Chemo Approved Deoxycoformycin National Cancer NSC 218321- DMSO 10000
analog Institute O/48
Estramustine Alkylating agent A. Conv. Chemo Approved Sigma-Aldrich E0407 AQ 10000
Floxuridine Antimetabolite; Analog of A. Conv. Chemo Approved 5-fluorodeoxyuridine National Cancer NSC 27640- DMSO 10000
5-fluorouracil Institute Z/31
Gemcitabine Antimetabolite; A. Conv. Chemo Approved Gemsar, National Cancer NSC 613327- DMSO 1000
Nucleoside analog Gemzar Institute S/2
Teniposide Topoisomerase II A. Conv. Chemo Approved National Cancer NSC 122819- DMSO 10000
inhibitor Institute I/52
Dactinomycin RNA and DNA synthesis A. Conv. Chemo Approved Actinomycin D National Cancer NSC 3053-Y/ DMSO 1000
inhibitor Institute 14
Streptozocin Alkylating glucosamine- A. Conv. Chemo Approved National Cancer NSC 37917- DMSO 10000
nitrosourea agent Institute V/5
Cladribine Antimetabolite; Purine A. Conv. Chemo Approved Leustatin National Cancer NSC 105014- DMSO 1000
analog Institute F/2
Mitomycin C Antineoplastic A. Conv. Chemo Approved National Cancer NSC 26980- DMSO 10000
anatibiotic; DNA Institute J/65
crosslinker
367
(continued)
Table 1
368
(continued)
High phase High

Carboplatin Platinum-based A. Conv. Chemo Approved Selleck S1215-2 AQ 100000

antineoplastic agent
Liye He et al.
Cisplatin Platinum-based A. Conv. Chemo Approved Selleck S1166-2 AQ 100000

Oxaliplatin Platinum-based A. Conv. Chemo Approved Selleck S1224-2 AQ 100000
Uracil mustard Alkylating agent A. Conv. Chemo Approved Uracil mustard National Cancer NSC 34462- DMSO 10000
Institute Q/2
Daunorubicin Topoisomerase II A. Conv. Chemo Approved National Cancer NSC 82151- DMSO 1000
inhibitor Institute A/44
Etoposide Topoisomerase II A. Conv. Chemo Approved National Cancer NSC 141540- DMSO 10000
inhibitor Institute H/184
Doxorubicin Topoisomerase II A. Conv. Chemo Approved Adriamycin Sigma-Aldrich D1515 DMSO 1000
inhibitor
Valrubicin Topoisomerase II A. Conv. Chemo Approved Valstar National Cancer NSC 246131- DMSO 5000
inhibitor Institute R/4
Ixabepilone Mitotic inhibitor. A. Conv. Chemo Approved azaepothilone B Ixempra National Cancer NSC 747973- DMSO 1000
Epothilone Institute W/3
microtubule stabilizer.
Fludarabine Antimetabolite; Purine A. Conv. Chemo Approved Fludara National Cancer NSC 312887- DMSO 10000
analog Institute C/7
Bleomycin Glycopeptide antibiotic; A. Conv. Chemo Approved Selleck S1214 DMSO 10000
causes DNA breaks
Carfilzomib Proteasome inhibitor A. Conv. Chemo Approved Kyprolis ChemieTek CT-CARF DMSO 1000
(20S subunit)
Bendamustine Nitrogen mustard A. Conv. Chemo Approved National Cancer NSC 138783 DMSO 10000
alkylating agent Institute
Capecitabine 5-FU prodrug A. Conv. Chemo Approved LC Laboratories C-2799 DMSO 10000
Chloroquine Antimalaria agent; A. Conv. Chemo Approved Sigma-Aldrich C6628 AQ 100000
chemo/radio
sensitizer
Omacetaxine Protein synthesis inhib A. Conv. Chemo Approved Homoharringtonine Synribo Santa Cruz sc-202652 DMSO 10000
(80 S ribosome) Biotechnology
Amsacrine DNA intercalation, Topo A. Conv. Chemo Approved Acridinyl anisidide Santa Cruz sc-214540 DMSO 10000
II inhibitor Biotechnology
Cabazitaxel Taxane microtubule A. Conv. Chemo Approved XRP6258 Jevtana Medchemexpress HY-15459 DMSO 1000
stabilizer, antimitotic
Oprozomib proteasome (20 S) A. Conv. Chemo Investigational PR-047 ChemieTek CT-OPRO DMSO 2500
inhibitor (Ph 1)
ABT-751 Mitotic inhibitor. A. Conv. Chemo Investigational Selleck S1165 DMSO 10000
Colchicine site binding (Ph 2)
microtubule
depolymerizer.
Indibulin Mitoric inhibitor. A. Conv. Chemo Investigational Tocris 3728 DMSO 10000
Microtubule (Ph 2) Biosciences
depolymerizer
Aldoxorubicin Topo II, albumin A. Conv. Chemo Investigational Medchemexpress HY-16261 DMSO 1000
(Ph 2)
Amonafide Topoisomerase II A. Conv. Chemo Investigational Xanafide, Quinamed Selleck S1367 DMSO 10000
inhibitor / DNA (Ph 3)
intercalator
Patupilone Mitotic inhibitor, A. Conv. Chemo Investigational Epothilone B, EpoB LC Laboratories E-5500-2 DMSO 1000
epothilone (Ph 3)
microtubule stabilizer
Pixantrone topoisomerase II A. Conv. Chemo Investigational Medchemexpress HY-13727A AQ 10000
inhibitor (Ph 3)
Camptothecin Topoisomerase I inhibitor A. Conv. Chemo Probe Tocris 1100 DMSO 5000
Biosciences
8-chloro-adenosine Nucleoside analog, RNA A. Conv. Chemo Probe 8-chloro-adenosine Northwestern NWU 8CL DMSO 50000
synthesis inhibitor University
8-amino-adenosine Nucleoside analog, RNA A. Conv. Chemo Probe 8-amino-adenosine Northwestern NWU 8NH2 DMSO 50000
synthesis inhibitor University
Hydroxyfasudil ROCK, PKA, PKG, PRK B. Kinase inhibitor Active Santa Cruz sc-202176 DMSO 37500
inhibitor metabolite Biotechnology
of approved
drug
Gefitinib EGFR inhibitor B. Kinase inhibitor Approved Iressa LC Laboratories G-4408 DMSO 10000
Imatinib Abl, Kit, PDGFRB B. Kinase inhibitor Approved Gleevec, LC Laboratories C-5508 DMSO 10000
inhibitor Glivec
369
(continued)
Table 1
370
(continued)
High phase High

Erlotinib EGFR inhibitor B. Kinase inhibitor Approved OSI-774 Tarceva National Cancer NSC 718781- DMSO 10000
Institute R/4
Liye He et al.
Lapatinib HER2, EGFR inhibitor B. Kinase inhibitor Approved GW2016 Tykerb, LC Laboratories L-4804 DMSO 1000
Tyverb
Palbociclib CDK inhibitor (Cdk4/6) B. Kinase inhibitor Approved Ibrance Selleck S1116 DMSO 10000
Afatinib EGFR inhibitor B. Kinase inhibitor Approved Gilotrif, Selleck S1011 DMSO 1000
Giotrif
Crizotinib ALK, c-Met inhibitor B. Kinase inhibitor Approved Xalkori Selleck S1068 DMSO 1000
Ponatinib Broad TK inhibitor B. Kinase inhibitor Approved Iclusig Selleck S1490 DMSO 1000
Trametinib MEK1/2 inhibitor B. Kinase inhibitor Approved JTP-74057 Mekinist ChemieTek CT-GSK112 DMSO 250
Ruxolitinib JAK1&2 inhibitor B. Kinase inhibitor Approved Jakafi, Jakavi ChemieTek CT-INCB DMSO 10000
Nilotinib Abl inhibitor B. Kinase inhibitor Approved Tasigna LC Laboratories N-8207 DMSO 10000
Vemurafenib B-Raf(V600E) inhibitor B. Kinase inhibitor Approved RG7204, RO5185426 Zelboraf ChemieTek CT-P4032 DMSO 10000
Vandetanib VEGFR,EGFR, RET B. Kinase inhibitor Approved Caprelsa LC Laboratories V-9402 DMSO 1000
inhibitor
Dasatinib Abl, Src, Kit, EphR... B. Kinase inhibitor Approved Sprycel LC Laboratories D-3307 DMSO 1000
Inhibitor
Tofacitinib JAK3, JAK2(V617F) B. Kinase inhibitor Approved tasocitinib Xeljanz, LC Laboratories T-1377 DMSO 5000
inhibitor Jakvinus
Axitinib VEGFR, PDGFR, KIT B. Kinase inhibitor Approved Inlyta LC Laboratories A-1107 DMSO 10000
inhibitor
Bosutinib Abl, Src inhibitor B. Kinase inhibitor Approved Bosulif LC Laboratories B-1788 DMSO 10000
Pazopanib VEGFR inhibitor B. Kinase inhibitor Approved Votrient LC Laboratories P-6706 DMSO 10000
Sorafenib B-Raf, FGFR-1, VEGFR- B. Kinase inhibitor Approved Nevaxar LC Laboratories S-8502 DMSO 1000
2 & -3, PDGFR-beta,
KIT, and FLT3 inhib
Sunitinib Broad TK inhibitor B. Kinase inhibitor Approved Sutent LC Laboratories S-8803 DMSO 1000
Regorafenib B-Raf, c-Kit, VEGFR2 B. Kinase inhibitor Approved Stivarga Selleck S1178 DMSO 10000
inhibitor
Cabozantinib VEGFR2, Met, FLT3, B. Kinase inhibitor Approved XL184 Cometriq ChemieTek CT-XL184 DMSO 1000
Tie2, Kit and Ret
inhibitor
Ibrutinib Btk inhibitor B. Kinase inhibitor Approved CRA-032765 Imbruvica Selleck S2680 DMSO 1000
Dabrafenib B-Raf(V600E) inhibitor B. Kinase inhibitor Approved Tafinlar ChemieTek CT-DABR DMSO 2500
Ceritinib ALK inhibitor B. Kinase inhibitor Approved LDK378 Zykadia Selleck S7083 DMSO 2500
Fasudil ROCK, PKA, PKG, PRK B. Kinase inhibitor Approved HA-1077 LC Laboratories H-2330 DMSO 50000
inhibitor, prodrug (Japan)
Alectinib ALK (incl gatekeeper B. Kinase inhibitor Approved Alecensa ChemieTek CT-CH542 DMSO 1000
mut) inhib (Japan)
Idelalisib PI3K inhibitor, B. Kinase inhibitor Approved (US) CAL-101 Zydelig ChemieTek CT-CAL101 DMSO 10000
p110δ-selective
Nintedanib VEGFR, PDGFR, FGFR B. Kinase inhibitor Approved (US) Indetanib Vargatef, Selleck S1010 DMSO 10000
inhibitor Ofev
Lenvatinib VEGFR inhibitor B. Kinase inhibitor Approved (US) Lenvima Selleck S1164 DMSO 2500
CUDC-101 HDAC & EGFR, Her2 B. Kinase inhibitor Investigational Selleck S1194 DMSO 10000
inhibitor (Ph 1)
PF-00477736 Chk1 inhibitor B. Kinase inhibitor Investigational Axon Medchem Axon 1379-2 DMSO 10000
(Ph 1)
AZD7762 Chk1 inhibitor B. Kinase inhibitor Investigational Axon Medchem Axon 1399 DMSO 1000
(Ph 1)
AZD8055 mTOR inhibitor B. Kinase inhibitor Investigational ChemieTek CT-A8055-3 DMSO 10000
(Ph 1)
Doramapimod p38MAPK inhibitor B. Kinase inhibitor Investigational Axon Medchem Axon 1358 DMSO 10000
(Ph 1)
Bryostatin 1 PKC activator B. Kinase inhibitor Investigational Santa Cruz sc-201407-4 DMSO 100
(Ph 1) Biotechnology
EMD1214063 c-Met inhibitor B. Kinase inhibitor Investigational ChemieTek CT-EMD063 DMSO 1000
(Ph 1)
AZD1480 JAK1/2, FGFR inhibitor B. Kinase inhibitor Investigational ChemieTek CT-A1480 DMSO 1000
(Ph 1)
Tamatinib Syk inhibitor B. Kinase inhibitor Investigational Selleck S2194-2 DMSO 10000
(Ph 1)
371
(continued)
Table 1
372
(continued)
High phase High

TAK-733 MEK1/2 inhibitor B. Kinase inhibitor Investigational Selleck S2617 DMSO 1000
(Ph 1)
Liye He et al.
Omipalisib PI3K/mTOR inhibitor B. Kinase inhibitor Investigational Selleck S2658 DMSO 1000
(Ph 1)
TAK-901 Aurora B inhibitor B. Kinase inhibitor Investigational Selleck S2718 DMSO 1000
(Ph 1)
NVP-BGJ398 FGFR inhibitor B. Kinase inhibitor Investigational BGJ398 ChemieTek CT-BGJ398 DMSO 1000
(Ph 1)
INK128 mTOR inhibitor B. Kinase inhibitor Investigational INK128 ChemieTek CT-INK128 DMSO 1000
(Ph 1)
ZSTK474 PI3K gamma selective B. Kinase inhibitor Investigational LC Laboratories Z-1066 DMSO 10000
inhibitor (Ph 1)
AZD2014 mTOR inhibitor, B. Kinase inhibitor Investigational Selleck S2783 DMSO 10000
ATP-competitive (Ph 1)
GSK2636771 PI3K beta selective B. Kinase inhibitor Investigational ChemieTek CT-GSK263 DMSO 10000
inhibitor (Ph 1)
Rebastinib Allosteric ABL, FLT3, B. Kinase inhibitor Investigational ChemieTek CT-DCC20 DMSO 1000
TIE2, TRKA inhibitor (Ph 1)
BMS-911543 JAK2 inhibitor B. Kinase inhibitor Investigational ChemieTek CT-BMS911 DMSO 10000
(Ph 1)
LY-294002 PI3K inhibitor B. Kinase inhibitor Investigational LC Laboratories L-7962 DMSO 100000
(Ph 1)
ASP3026 ALK inhibitor B. Kinase inhibitor Investigational ChemieTek CT-ASP302 DMSO 10000
(Ph 1)
PF-03758309 PAK inhibitor B. Kinase inhibitor Investigational ChemieTek CT-PF0375 DMSO 10000
(Ph 1)
AZD-8330 MEK1/2 inhibitor B. Kinase inhibitor Investigational ARRY-424704 ChemieTek CT-A8330 DMSO 10000
(Ph 1)
BMS-599626 Pan-HER inhibitor B. Kinase inhibitor Investigational AC-480 ChemieTek CT-BMS59 DMSO 10000
(Ph 1)
LY-2874455 FGFR inhibitor B. Kinase inhibitor Investigational Axon Medchem Axon 1981 DMSO 1000
(Ph 1)
SGI-1776 PIM kinase inhibitor B. Kinase inhibitor Investigational Selleck S2198 DMSO 10000
(Ph 1)
AT7519 CDK1, 2, 4, 6 and B. Kinase inhibitor Investigational Selleck S1524 DMSO 10000
9 inhibitor (Ph 1)
TAK-960 PLK1 inhibitor B. Kinase inhibitor Investigational Santa Cruz sc-364631 DMSO 2500
(Ph 1) Biotechnology
Lucitanib FGFR1, VEGFR B. Kinase inhibitor Investigational CO-3810, S 80881 Axon Medchem Axon 1942 DMSO 10000
inhibitor (Ph 1)
AMG-208 MET inhibitor B. Kinase inhibitor Investigational Selleck S1316 DMSO 2500
(Ph 1)
AMG-900 pan-Aurora inhibitor B. Kinase inhibitor Investigational Selleck S2719 DMSO 1000
(Ph 1)
ARRY-380 HER2 inhibitor B. Kinase inhibitor Investigational Selleck S2752 DMSO 2500
(Ph 1)
GSK-1070916 AURb, AURc inhibitor B. Kinase inhibitor Investigational Selleck S2740 DMSO 1000
(Ph 1)
GSK-461364 PLK1 inhibitor B. Kinase inhibitor Investigational Selleck S2193 DMSO 10000
(Ph 1)
NVP-INC280 MET inhibitor B. Kinase inhibitor Investigational INC280, INCB-28060 Selleck S2788 DMSO 1000
(Ph 1)
OSI-930 KIT, VEGFR inhibitor B. Kinase inhibitor Investigational Selleck S1220 DMSO 2500
(Ph 1)
Palomid-529 AKT, MTOR, PI3K B. Kinase inhibitor Investigational P529 Selleck S2238 DMSO 10000
inhibitor (Ph 1)
PF-00562271 FAK inhibitor B. Kinase inhibitor Investigational Selleck S2672 DMSO 10000
(Ph 1)
PF-03814735 AURa, AURb inhibitor B. Kinase inhibitor Investigational Selleck S2725 DMSO 10000
(Ph 1)
Gedatolisib PI3K/mTOR inhibitor B. Kinase inhibitor Investigational PKI-587 Selleck S2628 DMSO 1000
(Ph 1)
SNS-314 AURa, AURb inhibitor B. Kinase inhibitor Investigational Selleck S1154 DMSO 1000
(Ph 1)
TAK-285 HER2 inhibitor B. Kinase inhibitor Investigational Selleck S2784 DMSO 2500
(Ph 1)
373
(continued)
Table 1
374
(continued)
High phase High

MLN-8054 AURa AURb FLT3 KIT B. Kinase inhibitor Investigational Selleck S1100 DMSO 10000
(PDGFR) (Ph 1)
Liye He et al.
KW-2449 AURa AURb FLT3 B. Kinase inhibitor Investigational Selleck S2158 DMSO 2500
inhibitor (Ph 1)
KRN-633 VEGFR inhibitor B. Kinase inhibitor Investigational Selleck S1557 DMSO 2500
(Ph 1)
PHA-793887 CDK inhibitor B. Kinase inhibitor Investigational Selleck S1487 DMSO 10000
(Ph 1)
AZD-6482 PI3Kbeta-selective B. Kinase inhibitor Investigational Selleck S1462 DMSO 2500
inhibitor (Ph 1)
GSK-1059615 PI3K/mTOR inhibitor B. Kinase inhibitor Investigational GSK-615 Selleck S1360 DMSO 10000
(Ph 1)
CYC-116 Aurora and VEGFR2 B. Kinase inhibitor Investigational Selleck S1171 DMSO 10000
inhibitor (Ph 1)
CP-724714 EGFR ERBB2 inhibitor B. Kinase inhibitor Investigational Selleck S1167 DMSO 10000
(Ph 1)
SGX-523 MET inhibitor B. Kinase inhibitor Investigational Selleck S1112 DMSO 5000
(Ph 1)
JNJ-38877605 MET inhibitor B. Kinase inhibitor Investigational Selleck S1114 DMSO 10000
(Ph 1)
GSK-690693 AKT, PKA, PKC inhibitor B. Kinase inhibitor Investigational Selleck S1113 DMSO 10000
(Ph 1)
OSU-03012 PDPK1 inhibitor B. Kinase inhibitor Investigational AR-12 Selleck S1106 DMSO 25000
(Ph 1)
NVP-AEW541 IGF1R inhibitor B. Kinase inhibitor Investigational AEW541 Selleck S1034 DMSO 10000
(Ph 1)
PF-04217903 MET inhibitor B. Kinase inhibitor Investigational Selleck S1094 DMSO 2500
(Ph 1)
AZD-1080 GSK3 inhibitor B. Kinase inhibitor Investigational AZ-11548415 Selleck S7145 DMSO 10000
(Ph 1)
RG-7603 PI3K inhibitor, pan-class B. Kinase inhibitor Investigational GDC-0349 Selleck S8040 DMSO 2500
I (Ph 1)
MK-8776 CHEK1 inhibitor B. Kinase inhibitor Investigational SCH-900776 Selleck S2735 DMSO 2500
(Ph 1)
CH-5132799 PI3K inhibitor, pan-class B. Kinase inhibitor Investigational PA-799 Selleck S2699 DMSO 10000
I (Ph 1)
AZD-5438 CDK1,2,9 inhibitor B. Kinase inhibitor Investigational Selleck S2621 DMSO 10000
(Ph 1)
Silmitasertib CSNK2A1 inhibitor B. Kinase inhibitor Investigational Selleck S2248 DMSO 10000
(Ph 1)
Mubritinib ERBB2 inhibitor B. Kinase inhibitor Investigational Selleck S2216 DMSO 1000
(Ph 1)
AZD-8186 PI3Kbeta inhibitor B. Kinase inhibitor Investigational Active Biochem A-1610 DMSO 1000
(Ph 1)
XL019 JAK2 inhibitor B. Kinase inhibitor Investigational Selleck S7036 DMSO 10000
(Ph 1)
Bentamapimod JNK inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-14761 DMSO 10000
(Ph 1)
AZD1208 PIM1, 2, 3 kinase B. Kinase inhibitor Investigational Medchemexpress HY-15604 DMSO 10000
inhibitor (Ph 1)
BGB324 Axl inhibitor B. Kinase inhibitor Investigational R 428 Axon Medchem Axon 1946 DMSO 10000
(Ph 1)
CEP-37440 ALK inhibitor B. Kinase inhibitor Investigational ChemieTek CT-CEP374 DMSO 5000
(Ph 1)
AT13148 p70S6K, PKA, ROCK B. Kinase inhibitor Investigational Medchemexpress HY-16071 DMSO 10000
(AKT) inhibitor (Ph 1)
Cerdulatinib JAK, SYK inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-15999 DMSO 10000
(Ph 1)
TEW-7197 TGF-β receptor ALK4/ B. Kinase inhibitor Investigational EW-7197 Selleck S7530 DMSO 2500
ALK5 inhibitor (Ph 1)
GDC-0994 ERK inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-15947-2 DMSO 10000
(Ph 1)
Merestinib Met inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-15514A DMSO 1000
(Ph 1)
VS-4718 FAK inhibitor B. Kinase inhibitor Investigational PND-1186 Chemietek CT-VS4718 DMSO 10000
(Ph 1)
375
(continued)
Table 1
376
(continued)
High phase High

LY3009120 pan-RAF inhibitor B. Kinase inhibitor Investigational DP-4978 Medchemexpress HY-12558 DMSO 10000
(Ph 1)
Liye He et al.
BI 2536 PLK1 inhibitor B. Kinase inhibitor Investigational Selleck S1109 DMSO 1000
(Ph 2)
AT9283 Aurora A & B, Jak2, Flt, B. Kinase inhibitor Investigational Selleck S1134 DMSO 1000
Abl inhibitor (Ph 2)
Danusertib Aurora, Ret, TrkA, B. Kinase inhibitor Investigational Selleck S1107 DMSO 10000
FGFR-1 inhibitor (Ph 2)
Foretinib MET, VEGFR2 inhibitor B. Kinase inhibitor Investigational XL880, EXEL-2880 Selleck S1111 DMSO 1000
(Ph 2)
SNS-032 CDK inhibitor B. Kinase inhibitor Investigational BMS-387032 Selleck S1145 DMSO 10000
(Ph 2)
Alvocidib CDK inhibitor B. Kinase inhibitor Investigational Flavopiridol, Selleck S1230 DMSO 10000
(Ph 2) HMR-1275
Pimasertib MEK1/2 inhibitor B. Kinase inhibitor Investigational MSC1936369B Selleck S1475 DMSO 10000
(Ph 2)
Motesanib VEGFR, PDGFR, Ret, B. Kinase inhibitor Investigational Selleck S1032 DMSO 10000
Kit inhibitor (Ph 2)
PF-04691502 PI3K/mTOR inhibitor B. Kinase inhibitor Investigational ChemieTek CT-PF1502 DMSO 10000
(Ph 2)
MK1775 Wee1 inhibitor B. Kinase inhibitor Investigational Axon Medchem Axon 1494 DMSO 10000
(Ph 2)
BMS-754807 IGF1R inhibitor B. Kinase inhibitor Investigational ChemieTek CT-BMS75 DMSO 10000
(Ph 2)
OSI-027 mTOR inhibitor B. Kinase inhibitor Investigational ChemieTek CT-O027 DMSO 10000
(Ph 2)
Refametinib MEK1/2 inhibitor B. Kinase inhibitor Investigational RDEA119 ChemieTek CT-R119 DMSO 10000
(Ph 2)
MK-2206 AKT inhibitor B. Kinase inhibitor Investigational ChemieTek CT-MK2206 DMSO 1000
(Ph 2)
Linsitinib IGF1R, IR inhibitor B. Kinase inhibitor Investigational ASP7487 ChemieTek CT-O906 DMSO 10000
(Ph 2)
Tandutinib FLT3, PDGFR, KIT B. Kinase inhibitor Investigational CT53518 LC Laboratories T-7802 DMSO 1000
inhibitor (Ph 2)
Pictilisib PI3K inhibitor, pan-class B. Kinase inhibitor Investigational RG-7321 LC Laboratories G-9252 DMSO 10000
I (Ph 2)
Seliciclib CDK2/7/9 inhibitor B. Kinase inhibitor Investigational Roscovitine LC Laboratories R-1234 DMSO 10000
(Ph 2)
Dactolisib mTOR/(PI3K) inhibitor B. Kinase inhibitor Investigational LC Laboratories N-4288 DMSO 1000
(Ph 2)
Quizartinib FLT3 inhibitor B. Kinase inhibitor Investigational ChemieTek CT-AC220 DMSO 1000
(Ph 2)
Gandotinib JAK2 inhibitor B. Kinase inhibitor Investigational Selleck S2179 DMSO 10000
(Ph 2)
Sotrastaurin PKC inhibitor B. Kinase inhibitor Investigational AEB071 Axon Medchem Axon 1635-2 DMSO 10000
(Ph 2)
UCN-01 PKCbeta, PDK1, Chk, B. Kinase inhibitor Investigational 7-Hydroxy Sigma-Aldrich U6508-4 DMSO 10000
Cdk2 inhibitor (Ph 2) staurosporine
Tivantinib MET inhibitor B. Kinase inhibitor Investigational ChemieTek CT-ARQ197 DMSO 1000
(Ph 2)
RAF265 C-Raf inhibitor B. Kinase inhibitor Investigational CHIR-265 Selleck S2161 DMSO 1000
(Ph 2)
Rabusertib Chk1 inhibitor B. Kinase inhibitor Investigational IC-83 Selleck S2626 DMSO 1000
(Ph 2)
Galunisertib TGF-B/Smad inhibitor B. Kinase inhibitor Investigational Selleck S2230 DMSO 1000
(Ph 2)
Buparlisib PI3K inhibitor, pan-class B. Kinase inhibitor Investigational BKM-120 Selleck S2247 DMSO 10000
I (Ph 2)
Apitolisib PI3K/mTOR inhibitor B. Kinase inhibitor Investigational ChemieTek CT-G0980 DMSO 10000
(Ph 2)
AZD4547 FGFR inhibitor B. Kinase inhibitor Investigational ChemieTek CT-A4547 DMSO 1000
(Ph 2)
Sonolisib PI3K inhibitor, pan-class B. Kinase inhibitor Investigational DJM-166 Active Biochem PX-866 DMSO 10000
I. Irreversible (Ph 2)
(continued)
377
Table 1
378
(continued)
High phase High

Binimetinib MEK1/2 inhibitor B. Kinase inhibitor Investigational MEK162, ARRY- ChemieTek CT-A162 DMSO 1000
(Ph 2) 438162, ARRY-162
Liye He et al.
KX2-391 non-ATP competitive Src B. Kinase inhibitor Investigational Selleck S2700 DMSO 10000
inhibitor (Ph 2)
Fostamatinib Syk inhibitor B. Kinase inhibitor Investigational Selleck S2206-3 AQ 2500
(Ph 2)
Momelotinib JAK1 & 2 inhibitor B. Kinase inhibitor Investigational CYT1138 ChemieTek CT-CYT387 DMSO 10000
(Ph 2)
Ralimetinib p38MAPK inhibitor B. Kinase inhibitor Investigational CP868569 Selleck S1494 DMSO 10000
(Ph 2)
Crenolanib PDGFRA and PDGFRB B. Kinase inhibitor Investigational Selleck S2730 DMSO 10000
inhibitor (Ph 2)
GDC-0068 AKT inhibitor B. Kinase inhibitor Investigational RG7440 ChemieTek CT-G0068 DMSO 10000
(Ph 2)
Alpelisib PI3Kalpha inhibitor B. Kinase inhibitor Investigational BYL719 Selleck S2814 DMSO 2500
(Ph 2)
Baricitinib JAK inhibitor B. Kinase inhibitor Investigational INCB28050 Selleck S2851 DMSO 2500
(Ph 2)
AZD-5363 AKT inhibitor B. Kinase inhibitor Investigational ChemieTek CT-A5363 DMSO 10000
(Ph 2)
SAR302503 JAK2-selective inhibitor B. Kinase inhibitor Investigational TG-101348 ChemieTek CT-TG101 DMSO 10000
(Ph 2)
Bafetinib Abl, Lyn inhibitor B. Kinase inhibitor Investigational NS-187 Selleck S1369 DMSO 1000
(Ph 2)
Tideglusib GSK3 inhibitor B. Kinase inhibitor Investigational Selleck S2823 DMSO 3000
(Ph 2)
Rigosertib Ras-Raf interaction B. Kinase inhibitor Investigational Estybon, Selleck S1362 DMSO 10000
inhibitor (Ph 2) Novonex
Milciclib CDK2 inhibitor B. Kinase inhibitor Investigational Selleck S2751 DMSO 10000
(Ph 2)
Duvelisib PI3K inhibitor B. Kinase inhibitor Investigational INK1197 Selleck S7028 DMSO 500
(Ph 2)
Icotinib EGFR inhibitor B. Kinase inhibitor Investigational Selleck S2922 DMSO 10000
(Ph 2)
Amuvatinib Broad spectrum TK inhib B. Kinase inhibitor Investigational Selleck S1244 DMSO 10000
(Ph 2)
Pelitinib EGFR inhibitor B. Kinase inhibitor Investigational WAY-EKB 569 Selleck S1392 DMSO 2500
(Ph 2)
Telatinib VEGFR, KIT, PDGFR B. Kinase inhibitor Investigational Selleck S2231 DMSO 10000
inhibitor (Ph 2)
Triciribine AKT inhibitor B. Kinase inhibitor Investigational Pentaazacentopthylene, Selleck S1117-2 DMSO 100000
(Ph 2) Tricyclic nucleoside
Tozasertib pan-Aurora inhibitor B. Kinase inhibitor Investigational MK-0457 Selleck S1048 DMSO 10000
(Ph 2)
Varlitinib EGFR HER2 inhibitor B. Kinase inhibitor Investigational Selleck S2755 DMSO 10000
(Ph 2)
Golvatinib MET, VEGFR2 inhibitor B. Kinase inhibitor Investigational Selleck S2859 DMSO 2500
(Ph 2)
Copanlisib PI3K alpha, beta selective B. Kinase inhibitor Investigational Selleck S2802-2 DMSO 1000
inhibitor (Ph 2) w/
10mM
TFA
Sapitinib Pan-HER inhibitor B. Kinase inhibitor Investigational Selleck S2192 DMSO 1000
(Ph 2)
NVP-AEE788 EGFR, VEGFR, ABL, B. Kinase inhibitor Investigational AEE788, GNF-PF- Selleck S1486 DMSO 2500
SRC inhibitor (Ph 2) 5343
NVP-BGT226 PI3K/mTOR inhibitor B. Kinase inhibitor Investigational BGT226 Selleck S2749 DMSO 1000
(Ph 2)
BMS-777607 Met, Axl, Ron and Tyro3 B. Kinase inhibitor Investigational Selleck S1561 DMSO 2500
inhibitor (Ph 2)
Abemaciclib CDK4 and 6 inhibitor B. Kinase inhibitor Investigational Selleck S7158 DMSO 2500
(Ph 2)
VX 745 p38MAPK inhibitor B. Kinase inhibitor Investigational Tocris 3915 DMSO 10000
(Ph 2) Biosciences
XL-647 EGFR, ERBB2, VEGFR, B. Kinase inhibitor Investigational Santa Cruz sc-364659 DMSO 1000
EPHB4 (Ph 2) Biotechnology
379
(continued)
Table 1
380
(continued)
High phase High

PD184352 MEK1/2 inhibitor B. Kinase inhibitor Investigational Selleck S1020 DMSO 10000
(Ph 2)
Liye He et al.
ENMD-2076 pan-Aurora inhibitor B. Kinase inhibitor Investigational Selleck S1181 DMSO 10000
(Ph 2)
MK-2461 MET inhibitor B. Kinase inhibitor Investigational Selleck S2774 DMSO 10000
(Ph 2)
PD0325901 MEK1/2 inhibitor B. Kinase inhibitor Investigational Selleck S1036 DMSO 1000
(Ph 2)
PH-797804 p38MAPK inhibitor B. Kinase inhibitor Investigational Selleck S2726 DMSO 1000
(Ph 2)
TAK-715 p38MAPK inhibitor B. Kinase inhibitor Investigational Selleck S2928 DMSO 10000
(Ph 2)
TG100-115 PI3K gamma/delta B. Kinase inhibitor Investigational Selleck S1352 DMSO 10000
inhibitor (Ph 2)
CEP-32496 BRAF inhibitor B. Kinase inhibitor Investigational Selleck S8015 DMSO 10000
(Ph 2)
GDC-0623 MEK1/2 inhibitor B. Kinase inhibitor Investigational Active Biochem A-1181 DMSO 2500
(Ph 2)
Talmapimod p38MAPK alpha selective B. Kinase inhibitor Investigational Axon Medchem Axon 1671 DMSO 10000
inhibitor (Ph 2)
Encorafenib B-RAF(V600E) B. Kinase inhibitor Investigational LGX818 Selleck S7108 DMSO 1000
(Ph 2)
Tanzisertib JNK1, 2, 3 inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-15495 DMSO 10000
(Ph 2)
BMS863233 Cdc7 inhibitor B. Kinase inhibitor Investigational Selleck S7547 AQ 10000
(Ph 2)
Entospletinib SYK inhibitor B. Kinase inhibitor Investigational Selleck S7523 DMSO 5000
(Ph 2)
Voxtalisib mTOR/PI3K inhibitor B. Kinase inhibitor Investigational XL765 ChemieTek CT-XL765c DMSO 10000
(Ph 2)
Pilaralisib PI3K inhibitor. Pan-class B. Kinase inhibitor Investigational XL147 Medchemexpress HY-16526 DMSO 2500
I (Ph 2)
Uprosertib AKT inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-15965 DMSO 10000
(Ph 2)
Filgotinib JAK1-selective inhibitor B. Kinase inhibitor Investigational Selleck S7605 DMSO 10000
(Ph 2)
Afuresertib AKT1-selective inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-15966A DMSO 1000
(Ph 2)
PF-06463922 ALK, ROS1 inhibitor B. Kinase inhibitor Investigational Selleck S7536 DMSO 1000
(Ph 2)
SLx-2119 ROCK2 inhibitor B. Kinase inhibitor Investigational KD025 Medchemexpress HY-15307 DMSO 5000
(Ph 2)
Poziotinib pan-HER inhibitor B. Kinase inhibitor Investigational NOV120101 Medchemexpress HY-15730 DMSO 1000
(Ph 2)
Spebrutinib BTK inhibitor B. Kinase inhibitor Investigational AVL-292 Medchemexpress HY-18012 DMSO 1000
(Ph 2)
Ulixertinib ERK inhibitor B. Kinase inhibitor Investigational VRT752271 Chemietek CT-VRT752 DMSO 10000
(Ph 2)
Prexasertib Chk1 inhibitor B. Kinase inhibitor Investigational Medchemexpress HY-18174A DMSO 10000
(Ph 2)
Vatalanib VEGFR-1 & -2 inhibitor B. Kinase inhibitor Investigational PTK 787, ZK222584 LC Laboratories V-8303 DMSO 10000
(Ph 3)
Orantinib KDR, FGFR, PDGFR B. Kinase inhibitor Investigational SU6668 Selleck S1470 DMSO 10000
inhibitor (Ph 3)
Selumetinib MEK1/2 inhibitor B. Kinase inhibitor Investigational ARRY-142886 Selleck S1008 DMSO 10000
(Ph 3)
Dovitinib FGFR inhibitor B. Kinase inhibitor Investigational TKI258 Selleck S1018 DMSO 10000
(Ph 3)
Perifosine AKT/PI3K inhibitor B. Kinase inhibitor Investigational Selleck S1037 AQ 2500
(Ph 3)
Cediranib KDR/Flt/VEGFR B. Kinase inhibitor Investigational Recentin Selleck S1017 DMSO 1000
inhibitor (Ph 3)
Tivozanib VEGFR1, 2, 3, c-Kit, B. Kinase inhibitor Investigational KRN951 ChemieTek CT-AV951 DMSO 10000
PDGFRB inhibitor (Ph 3)

AZD1152-HQPA Aurora B inhibitor B. Kinase inhibitor Investigational ChemieTek CT-A1152H DMSO 1000
(Ph 3)
381
(continued)
Table 1
382
(continued)
High phase High

Alisertib Aurora A inhibitor B. Kinase inhibitor Investigational ChemieTek CT-M8237 DMSO 10000
(Ph 3)
Liye He et al.
Lestaurtinib FLT3, JAK2, TrkA, TrkB, B. Kinase inhibitor Investigational LC Laboratories L-6307 DMSO 1000
TrkC inhibitor (Ph 3)
Saracatinib Src, Abl inhibitor B. Kinase inhibitor Investigational LC Laboratories S-8906 DMSO 10000
(Ph 3)
Canertinib pan-HER inhibitor B. Kinase inhibitor Investigational PD 183805 LC Laboratories C-1201 DMSO 10000
(Ph 3)
Enzastaurin PKCbeta inhibitor B. Kinase inhibitor Investigational LC Laboratories E-4506 DMSO 10000
(Ph 3)
Masitinib KIT inhibitor B. Kinase inhibitor Investigational LC Laboratories M-7007 DMSO 10000
(Ph 3)
Midostaurin PKC, PKA, S6K and B. Kinase inhibitor Investigational PKC412, CGP 41251 LC Laboratories P-7600 DMSO 10000
EGFR inhibitor (Ph 3)
Ruboxistaurin PKCbeta inhibitor B. Kinase inhibitor Investigational Axon Medchem Axon 1401-2 DMSO 10000
(Ph 3)
Volasertib PLK1 inhibitor B. Kinase inhibitor Investigational ChemieTek CT-BI6727 DMSO 1000
(Ph 3)
Neratinib EGFR inhibitor B. Kinase inhibitor Investigational Selleck S2150(2) DMSO 1000
(Ph 3)
Linifanib VEGFR, PDGFR, B. Kinase inhibitor Investigational AL-39324, RG3635 Selleck S1003 DMSO 1000
CSF-1R, FLT3 (Ph 3)
inhibitor
Brivanib VEGFR inhibitor B. Kinase inhibitor Investigational BMS-582664 Selleck S1084 DMSO 1000
(Ph 3)
Dacomitinib pan-HER inhibitor B. Kinase inhibitor Investigational ChemieTek CT-DACO DMSO 1000
(Ph 3)
Dinaciclib CDK inhibitor B. Kinase inhibitor Investigational ChemieTek CT-DINA DMSO 1000
(Ph 3)
Apatinib VEGFR inhibitor B. Kinase inhibitor Investigational Selleck S2221 DMSO 10000
(Ph 3)
Semaxanib VEGFR inhibitor B. Kinase inhibitor Investigational Selleck S2845 DMSO 10000
(Ph 3)
NVP-LEE011 CDK4/6 inhibitor B. Kinase inhibitor Investigational LEE011 Selleck S7440 DMSO 10000
(Ph 3)
Pacritinib FLT3/JAK2 B. Kinase inhibitor Investigational Selleck S8057 DMSO 10000
(Ph 3)
Cobimetinib MEK1/2 inhibitor B. Kinase inhibitor Investigational XL-518 Medchemexpress HY-13064 DMSO 1000
(Ph 3)
Osimertinib EGFR(L858R/T790M) B. Kinase inhibitor Investigational Selleck S7297 DMSO 2500
inhibitor (Ph 3)
Losmapimod p38MAPK inhibitor B. Kinase inhibitor Investigational Selleck S7215 DMSO 10000
(Ph 3)
Rociletinib EGFR(L858R/T790M) B. Kinase inhibitor Investigational AVL-301, CNX-419 ChemieTek CT-CO1686 DMSO 10000
inhibitor (Ph 3)
Taselisib PI3K alpha, gamma B. Kinase inhibitor Investigational RG7604 Medchemexpress HY-13898-2 DMSO 1000
selective inhibitor (Ph 3)
Pexidartinib KIT, CSF1R, FLT3 B. Kinase inhibitor Investigational Medchemexpress HY-16749 DMSO 10000
inhibitor (Ph 3)
AZ 3146 Mps1 kinase (TTK) B. Kinase inhibitor Probe Tocris 3994 DMSO 10000
inhibitor Biosciences
PF-04708671 p70S6K inhibitor B. Kinase inhibitor Probe Sigma-Aldrich PZ0143 DMSO 10000
TGX-221 PI3K beta selective B. Kinase inhibitor Probe ChemieTek CT-TGX221 DMSO 10000
inhibitor
VX-11E ERK1 & 2 inhibitor B. Kinase inhibitor Probe ChemieTek CT-VX11e DMSO 2500
PF-4800567 CK1epsilon inhibitor B. Kinase inhibitor Probe Tocris 4281 DMSO 10000
Biosciences
PF-670462 CK1epsilon and B. Kinase inhibitor Probe Tocris 3316 DMSO 10000
CK1delta inhibitor Biosciences
(5Z)-7-Oxozeaenol TAK1 inhibitor B. Kinase inhibitor Probe (5Z)-7-Oxozeaenol Tocris 3604-03-01 DMSO 10000
Biosciences
GSK269962 ROCK1 and ROCK2 B. Kinase inhibitor Probe Tocris 4009 DMSO 10000
PF 431396 FAK/PYK2 inhibitor B. Kinase inhibitor Probe Tocris 4278 DMSO 10000
Biosciences
GSK650394 SGK1 & 2 inhibitor B. Kinase inhibitor Probe Tocris 3572 DMSO 10000
Biosciences
383
(continued)
Table 1
384
(continued)
High phase High

AZ-23 Trk inhibitor B. Kinase inhibitor Probe Axon Medchem Axon 1610 DMSO 1000
Liye He et al.
GSK-1838705A IGF1R, INSR, ALK inhib B. Kinase inhibitor Probe ChemieTek CT-GSK183 DMSO 2500
GSK-1904529A IGF1R, INSR inhib B. Kinase inhibitor Probe ChemieTek CT-GSK190 DMSO 10000
SP600125 pan-JNK inhibitor B. Kinase inhibitor Probe Pyrazolanthrone, LC Laboratories S-7979 DMSO 100000
Anthrapyrazolone
BX-912 PDK1 inhib B. Kinase inhibitor Probe Selleck S1275 DMSO 10000
GSK-2334470 PDK1 inhibitor B. Kinase inhibitor Probe ChemieTek CT-GSK233 DMSO 10000
SCH772984 ERK1 & 2 inhibitor B. Kinase inhibitor Probe ChemieTek CT-SCH772 DMSO 10000
PKI-402 PI3K/mTOR inhibitor B. Kinase inhibitor Probe Selleck S2739 DMSO 2500
PHA 408 IKK-2 inhibitor B. Kinase inhibitor Probe Axon Medchem Axon 1651 DMSO 10000
PS-1145 IKK-2 inhibitor B. Kinase inhibitor Probe Axon Medchem Axon 1568 DMSO 25000
KU-60019 ATM inhibitor B. Kinase inhibitor Probe Selleck S1570 DMSO 25000
TPCA-1 IKK-2 inhibitor B. Kinase inhibitor Probe Selleck S2824 DMSO 25000
IRAK1/4 inhibitor IRAK1/4 inhibitor B. Kinase inhibitor Probe IRAK1/4 inhibitor Merck Millipore 407601 DMSO 25000
MK-8745 Aurora A inhibitor B. Kinase inhibitor Probe Selleck S7065 DMSO 2500
AZ191 DYRK1A inhibitor B. Kinase inhibitor probe Selleck S7338 DMSO 10000
VE-821 ATR inihibitor B. Kinase inhibitor Probe Selleck S8007 DMSO 10000
AZD7545 PDHK inhibitor B. Kinase inhibitor probe Selleck S7517 DMSO 10000
GNE-0877 LRRK2 inhibitor B. Kinase inhibitor probe Selleck S7367 DMSO 1000
OTSSP167 MELK inhibitor B. Kinase inhibitor probe Selleck S7159 DMSO 1000
UNC2881 MER inhibitor B. Kinase inhibitor probe Selleck S7325 DMSO 2500
AMG-925 FLT-3, CDK4 inhibitor B. Kinase inhibitor Probe ChemieTek CT-AMG925 DMSO 1000
FRAX486 PAK1, 2, 3 inhibitor B. Kinase inhibitor probe ChemieTek CT-F486 DMSO 5000
GNE-7915 LRRK2 inhibitor B. Kinase inhibitor probe ChemieTek CT-GNE79 DMSO 1000
TAK-632 pan-RAF inhibitor B. Kinase inhibitor probe ChemieTek CT-TAK632 DMSO 10000
GSK2656157 PERK inhibitor B. Kinase inhibitor probe Medchemexpress HY-13820 DMSO 2500
Tacrolimus Binds FKBP12, causes C. Rapalog Approved Fujimycin Prograf, Tocris 3631 DMSO 10000
inhibition of Advagraf, Biosciences
calcineurin Protopic
Everolimus binds FKBP12, causes C. Rapalog Approved SDZ-RAD Afinitor, LC Laboratories E-4040 DMSO 100
inhibition of Certican,
mTORC1 Zortress
Temsirolimus binds FKBP12, causes C. Rapalog Approved Torisel LC Laboratories T-8040 DMSO 100
inhibition of
mTORC1
Sirolimus binds FKBP12, causes C. Rapalog Approved Rapamycin Rapamune LC Laboratories R-5000 DMSO 100
inhibition of
mTORC1
Ridaforolimus binds FKBP12, causes C. Rapalog Investigational AP 23573, Deforolimus Active Biochem A-1004 DMSO 100
inhibition of (Ph 3)
mTORC1
Dexamethasone Immunosuppresant; D. Immunomodulatory Approved Decadron, Selleck S1322 DMSO 10000
glucocorticoid Dexpak
Thalidomide Immunosuppresant D. Immunomodulatory Approved National Cancer NSC 66847- DMSO 10000
Institute R/5
Imiquimod Immunomodulatory D. Immunomodulatory Approved National Cancer NSC 369100- DMSO 2500
agent, TLR7 agonist Institute F/4
Levamisole Immunomodulatory D. Immunomodulatory Approved Tetramisole Sigma-Aldrich L9756 DMSO 10000
agent
Methylprednisolone Immunosuppressant D. Immunomodulatory Approved Santa Cruz sc-205749 DMSO 10000
Biotechnology
Prednisolone Immunomodulatory D. Immunomodulatory Approved Santa Cruz sc-205815 DMSO 10000
agent Biotechnology
Prednisone Immunomodulatory D. Immunomodulatory Approved Santa Cruz sc-205816 DMSO 10000
agent Biotechnology
Bimatoprost Prostaglandin analog D. Immunomodulatory Approved Latisse, Selleck S1407 DMSO 5500
Lumigan,
Prostamide
Lenalidomide Immunomodulatory D. Immunomodulatory Approved Revlimid LC Laboratories L-5499 DMSO 100000
385
(continued)
Table 1
386
(continued)
High phase High

Pomalidomide Immunomodulatory D. Immunomodulatory Approved 3-amino-thalidomide Pomalyst Sigma-Aldrich P0018 DMSO 10000
agent, anti-angiogenic
Liye He et al.
VGX-1027 Immunomodulator D. Immunomodulatory Investigational Selleck S7515 DMSO 10000

(Ph 1)
NLG919 IDO inhibitor D. Immunomodulatory Investigational Selleck S7111 DMSO 10000
(Ph 1)
Epacadostat IDO inhibitor D. Immunomodulatory Investigational Medchemexpress HY-15689 DMSO 10000
(Ph 2)
Tasquinimod S100A9, D. Immunomodulatory Investigational Medchemexpress HY-10528 DMSO 10000
immunomodulatory, (Ph 3)
anti-angiogenic
Tretinoin Retinoic acid receptor E. Differentiating/ Approved ATRA, retinoic acid, National Cancer NSC 122758- DMSO 10000
agonist epigenetic modifier vitamin A Institute Q/6
Bexarotene Antineoplastic agent; E. Differentiating/ Approved Santa Cruz sc-217753 DMSO 10000
retinoid specifically epigenetic modifier Biotechnology
selective for retinoid X
receptors
Decitabine Nucleoside analog DNA E. Differentiating/ Approved 5-aza-2’-deoxycytidine Selleck S1200 DMSO 10000
methyl transferase epigenetic modifier
inhibitor
Vorinostat HDAC inhibitor E. Differentiating/ Approved SAHA Zolinza LC Laboratories V-8477 DMSO 10000
epigenetic modifier
Olaparib PARP inhibitor E. Differentiating/ Approved KU-0059436 Lynparza LC Laboratories O-9201 DMSO 10000
epigenetic modifier
Arsenic(III) oxide Thioredoxin reductase E. Differentiating/ Approved Sigma-Aldrich 202673 AQ 2500
inhibitor; cytotoxic epigenetic modifier
chemotherapeutic
Azacitidine Nucleoside analog DNA E. Differentiating/ Approved 5-azacytidine, 5-AzaC Vidaza National Cancer NSC 102816- DMSO 10000
methyl transferase epigenetic modifier Institute P/21
inhibitor
Valproic acid HDAC inhibitor E. Differentiating/ Approved Sigma-Aldrich P4543 AQ 1E+06
epigenetic modifier
Romidepsin HDAC inhibitor E. Differentiating/ Approved FR901228 Istodax Medchemexpress HY-15149 DMSO 1000
epigenetic modifier
Belinostat HDAC inhibitor E. Differentiating/ Approved (US) PX105684 ChemieTek CT-BELI DMSO 10000
epigenetic modifier
Panobinostat HDAC inhibitor E. Differentiating/ Approved (US) Farydak, LC Laboratories P-3703 DMSO 1000
epigenetic modifier Faridak
Niraparib PARP inhibitor E. Differentiating/ Investigational ChemieTek CT-MK4827 DMSO 10000
epigenetic modifier (Ph 1)
Rocilinostat HDAC-6 selective E. Differentiating/ Investigational ChemieTek CT-ACY12 DMSO 10000
inhibitor epigenetic modifier (Ph 1)
AR-42 HDAC inhibitor E. Differentiating/ Investigational OSU-HDAC42 Selleck S2244 DMSO 10000
E7438 EZH2 inhibitor E. Differentiating/ Investigational EPZ-6438 Chemietek CT-EPZ438 DMSO 10000
EPZ-5676 DOT1L inhibitor E. Differentiating/ Investigational Selleck S7062 DMSO 1000
OTX015 BET family inhibitor E. Differentiating/ Investigational Selleck S7360 DMSO 10000
CUDC-907 HDAC1/2/3/10, E. Differentiating/ Investigational Selleck S2759 DMSO 10000
PI3Kalpha inhibitor epigenetic modifier (Ph 1)
GSK525762 BET family inhibitor E. Differentiating/ Investigational I-BET762 ChemieTek CT-BET762 DMSO 10000
BAY 87-2243 HIF1alpha inhibitor E. Differentiating/ Investigational Selleck S7309 DMSO 1000
GSK2879552 LSD1 inhibitor E. Differentiating/ Investigational Chemietek CT-GSK287 DMSO 10000
Rucaparib PARP inhibitor E. Differentiating/ Investigational AG-014447, Axon Medchem Axon 1529 DMSO 10000
epigenetic modifier (Ph 2) PF-01367338
Entinostat HDAC inhibitor E. Differentiating/ Investigational SNDX-275 LC Laboratories E-3866 DMSO 10000
Mocetinostat HDAC inhibitor E. Differentiating/ Investigational Selleck S1122 DMSO 10000
(HDAC1 & epigenetic modifier (Ph 2)
2-selective)
Quisinostat HDAC inhibitor E. Differentiating/ Investigational Active Biochem A-1162 DMSO 1000
FG-4592 HIF prolyl hydroxylase E. Differentiating/ Investigational Selleck S1007 DMSO 10000
387
(continued)
Table 1
388
(continued)
High phase High

Lomeguatrib O6-methylguanine-DNA E. Differentiating/ Investigational Tocris 4359 DMSO 10000

methyltransferase epigenetic modifier (Ph 2) Biosciences
Liye He et al.
inhibitor
Pracinostat HDAC inhibitor E. Differentiating/ Investigational Selleck S1515 DMSO 10000
Resminostat HDAC1, 3, 6 inhibitor E. Differentiating/ Investigational Medchemexpress HY-14718A DMSO 10000
Givinostat HDAC inhibitor E. Differentiating/ Investigational Selleck S2170 DMSO 1000
Abexinostat HDAC1-selective E. Differentiating/ Investigational Medchemexpress HY-10990 DMSO 10000
Veliparib PARP inhibitor E. Differentiating/ Investigational Selleck S1004 DMSO 10000
Tipifarnib Farnesyltransferase E. Differentiating/ Investigational Zarnestra, IND 58359 Selleck S1453 DMSO 10000
Tacedinaline HDAC inhibitor E. Differentiating/ Investigational Acetyldinaline, Gö LC Laboratories C-2606 DMSO 1000
epigenetic modifier (Ph 3) 5549, PD 123654,
Iniparib PARP inhibitor E. Differentiating/ Investigational IND-71677 Axon Medchem Axon 1566 DMSO 10000
Lonafarnib Farnesyl transferase E. Differentiating/ Investigational Sarasar Selleck S2797 DMSO 10000
Talazoparib PARP1/2 inhibitor E. Differentiating/ Investigational Medchemexpress HY-16106 DMSO 1000
XAV-939 Tankyrase-1 and -2 E. Differentiating/ Probe Selleck S1180 DMSO 10000
epigenetic modifier
(+)JQ1 BET family inhibitor E. Differentiating/ Probe (+)JQ1, SGCBD01(+) SGC SGCBD01(+) DMSO 10000
epigenetic modifier
Tubacin HDAC6 inhibitor E. Differentiating/ Probe Tubacin Selleck S2239 DMSO 10000
epigenetic modifier
Tubastatin A HDAC6 inhibitor E. Differentiating/ Probe Tubastatin A ChemieTek CT-TUBA DMSO 10000
epigenetic modifier
StemRegenin 1 AHR antagonist, stem cell E. Differentiating/ Probe StemRegenin 1, SR1 ChemieTek CT-SR1 DMSO 10000
regenerating epigenetic modifier
PFI-1 Selective chemical probe E. Differentiating/ Probe SGC SGCPFI DMSO 10000
for BET epigenetic modifier
Bromodomains
I-BET151 BET family inhibitor E. Differentiating/ Probe GSK1210151A ChemieTek CT-BET151 DMSO 10000
epigenetic modifier
IOX-2 PHD2 inhibitor E. Differentiating/ Probe Tocris 4451 DMSO 50000
epigenetic modifier Biosciences
GSK-J4 JMJD3 (histone E. Differentiating/ Probe Selleck S7070-2 DMSO 100000
demethylase) inhibitor epigenetic modifier
UNC1215 L3MBTL3 inhibitor E. Differentiating/ Probe Tocris 4666 DMSO 30000
SGC0946 DOT1L inhibitor E. Differentiating/ Probe Selleck S7079 DMSO 10000
epigenetic modifier
UNC0642 G9a/GLP inhibitor E. Differentiating/ Probe Tocris 5132 DMSO 10000
GSK343 EZH2 inhibitor E. Differentiating/ Probe SGC SGCGSK343 DMSO 10000
epigenetic modifier
UNC0638 G9a/GLP inhibitor E. Differentiating/ Probe Tocris 4343 DMSO 10000
C646 p300/CREB-binding E. Differentiating/ Probe Axon Medchem Axon 1781 DMSO 25000
protein (CBP) epigenetic modifier
inhibitor
EPZ-5687 EZH2 inhibitor E. Differentiating/ Probe ChemieTek CT-EPZ687 DMSO 10000
epigenetic modifier
IOX-1 2-Oxoglutarate E. Differentiating/ Probe Selleck S7234 DMSO 100000
Oxygenase Inhibitor epigenetic modifier
GSK2801 BAZ2B/A bromodomain E. Differentiating/ Probe SGC SGCGSK2801 DMSO 10000
inhibitor epigenetic modifier
SGC-CBP30 CREBBP/EP300 E. Differentiating/ Probe Medchemexpress HY-15826 DMSO 25000
bromodomain epigenetic modifier
inhibitor
RGFP966 HDAC3 inhibitor E. Differentiating/ Probe Selleck S7229 DMSO 10000
epigenetic modifier
PTC-209 BMI-1 inhibitor E. Differentiating/ probe Selleck S7539 DMSO 10000
epigenetic modifier
389
(continued)
Table 1
390
(continued)
High phase High

UM729 Enhancer of aryl E. Differentiating/ probe Medchemexpress HY-15972 DMSO 10000

hydrocarbon receptor epigenetic modifier
Liye He et al.
(AhR) antagonists
PCI-34051 HDAC8 inhibitor E. Differentiating/ probe Medchemexpress HY-15224 DMSO 10000
epigenetic modifier
EPZ015666 PRMT5 inhibitor E. Differentiating/ Probe Selleck S7748 DMSO 10000
epigenetic modifier
Goserelin Gonadotropin releasing F. Hormone therapy Approved Tocris 3592 DMSO 10000
hormone superagonist Biosciences
Raloxifene Selective estrogen F. Hormone therapy Approved National Cancer NSC 747974- DMSO 10000
receptor modulator Institute X/1
Letrozole Aromatase inhibitor F. Hormone therapy Approved Femara National Cancer NSC 719345- DMSO 10000
Institute G/2
Anastrozole Aromatase inhibitor F. Hormone therapy Approved National Cancer NSC 719344- DMSO 10000
Institute F/2
Bicalutamide Nonsteriodal F. Hormone therapy Approved Casodex, ChemieTek CT-BIC DMSO 10000
antiandrogen Cosudex,
Calutide,
Kalumid
Aminoglutethimide Anti-steroid, aromatase F. Hormone therapy Approved Sigma-Aldrich A9657 DMSO 10000
inhibitor
Clomifene Selective estrogen F. Hormone therapy Approved Clomid, Serophene, Selleck S2561 DMSO 10000
receptor modulator Milophene
Finasteride type II 5-alpha reductase F. Hormone therapy Approved Tocris 3293 DMSO 10000
Flutamide Nonsteroidal F. Hormone therapy Approved Tocris 4094 DMSO 10000
antiandrogen Biosciences
Fulvestrant Estrogen receptor F. Hormone therapy Approved Selleck S1191 DMSO 1000
antagonist
Megestrol Progestogen F. Hormone therapy Approved Megace National Cancer NSC 71423- DMSO 10000
Institute Q/12
Tamoxifen Estrogen receptor F. Hormone therapy Approved National Cancer NSC 180973- DMSO 10000
antagonist Institute S/203
Nilutamide Nonsteroidal F. Hormone therapy Approved Santa Cruz sc-203644 DMSO 10000
antiandrogen Biotechnology
Exemestane Aromatase inhibitor F. Hormone therapy Approved National Cancer NSC 713563- DMSO 10000
Institute U/2
Abiraterone P450 17alpha- F. Hormone therapy Approved Selleck S1123 DMSO 5000
hydroxylase-17,20-
lyase inhibitor
Toremifene selective estrogen F. Hormone therapy Approved Fareston, Santa Cruz sc-253712 DMSO 10000
receptor modulator Acapodene Biotechnology
Lasofoxifene Selective estrogen F. Hormone therapy Approved Oporia Santa Cruz sc-211721 DMSO 1000
receptor modulator Biotechnology
Enzalutamide AR antagonist F. Hormone therapy Approved Xtandi Axon Medchem Axon 1613 DMSO 10000
ARN 509 AR antagonist F. Hormone therapy Investigational Axon Medchem Axon 1979 DMSO 10000
(Ph 2)
Orteronel CYP17A1, androgen F. Hormone therapy Investigational Selleck S1195 DMSO 10000
synth inhib. (Ph 3)
4-hydroxy- Selective estrogen F. Hormone therapy Investigational Santa Cruz sc-3542 DMSO 10000
tamoxifen receptor modulator as a gel Biotechnology
preparation
RD162 AR antagonist F. Hormone therapy Probe Axon Medchem Axon 1532 DMSO 10000
Serdemetan HDM2-p53 antagonist G. Apoptotic Investigational Selleck S1172 DMSO 10000
modulator (Ph 1)
APR-246 p53 activator, thioredoxin G. Apoptotic Investigational Prima-1 Met Tocris 3710 DMSO 10000
reductase 1 inhibitor modulator (Ph 1) Biosciences
PAC-1 procaspase-3 activator G. Apoptotic Investigational Selleck S2738 DMSO 10000
modulator (Ph 1)
AT-406 XIAP, cIAP1, cIAP2 G. Apoptotic Investigational Selleck S2754 DMSO 10000
inhibitor modulator (Ph 1)
Venetoclax Bcl-2-selective inhibitor G. Apoptotic Investigational GDC-0199 ChemieTek CT-A199 DMSO 1000
modulator (Ph 1)
Verdinexor XPO1/CRM1 inhibitor G. Apoptotic Investigational Medchemexpress HY-15970 DMSO 1000
modulator (Ph 1)
SAR405838 MDM2 inhibitor G. Apoptotic Investigational MI-773 Selleck S7649 DMSO 10000
modulator (Ph 1)
391
(continued)
Table 1
392
(continued)
High phase High

AT 101 Bcl-2 family inhibitor G. Apoptotic Investigational R-(-).gossypol Selleck S2812 DMSO 100000
modulator (Ph 2)
Liye He et al.
Navitoclax Bcl-2/Bcl-xL inhibitor G. Apoptotic Investigational Selleck S1001-2 DMSO 10000

modulator (Ph 2)
YM155 Survivin inhibitor G. Apoptotic Investigational ChemieTek CT-YM155 DMSO 10000
modulator (Ph 2)
Birinapant IAPs, SMAC mimetic G. Apoptotic Investigational Chemietek CT-BIRI DMSO 1000
modulator (Ph 2)
LCL161 IAPs, SMAC mimetic G. Apoptotic Investigational Chemietek CT-LCL161 DMSO 25000
modulator (Ph 2)
Selinexor CRM1 inhibitor G. Apoptotic Investigational Selleck S7252 DMSO 10000
modulator (Ph 2)
AMG-232 MDM2 inhibitor G. Apoptotic Investigational ChemieTek CT-AMG232 DMSO 1000
modulator (Ph 2)
Obatoclax Bcl-2 family inhibitor G. Apoptotic Investigational Selleck S1057 DMSO 10000
modulator (Ph 3)
Nutlin-3 MDM2 inhibitor G. Apoptotic Probe Nutlin-3 Selleck S1061 DMSO 10000
modulator
RG7388 MDM2 inhibitor G. Apoptotic Probe Medchemexpress HY-15676 DMSO 10000
modulator
WEHI-539 Bcl-XL inhibitor G. Apoptotic Probe Medchemexpress HY-15607A DMSO 2500
modulator
UMI-77 MCL1 inhibitor G. Apoptotic probe Selleck S7531 DMSO 100000
modulator
Sabutoclax pan-Bcl-2 family inhibitor G. Apoptotic Probe ONT-701 Selleck S8061 DMSO 25000
modulator
Pyridoclax MCL-1 inhibitor G. Apoptotic probe Medchemexpress HY-12527 DMSO 25000
modulator
A-1210477 MCL-1 inhibitor G. Apoptotic probe Active Biochem A-9036 DMSO 50000
modulator
Methotrexate Antimetabolite; Anti- H. Metabolic modifier Approved Selleck S1210 DMSO 5000
folate agent
Pemetrexed Dihydrofolate reductase H. Metabolic modifier Approved Alimta LC Laboratories P-7177 AQ 10000
inhibitor
Atorvastatin HMG CoA reductase H. Metabolic modifier Approved Lipitor ChemieTek CT-ATOR DMSO 10000
inhibitor
Lovastatin HMG-CoA reductase H. Metabolic modifier approved Selleck S2061 DMSO 10000
inhibitor (non-
oncology)
Metformin AMPK activator H. Metabolic modifier Approved Tocris 2864 AQ 100000
(non- Biosciences
oncology)
Simvastatin HMG CoA reductase H. Metabolic modifier Approved Zocor Sigma-Aldrich S6196 DMSO 10000
inhibitor (non-
oncology)
Disulfiram alcohol dehydrogenase H. Metabolic modifier Approved Antabuse Selleck S1680 DMSO 100000
inhibitor (non-
oncology)
Pravastatin HMG CoA reductase H.Metabolic modifier Approved Eptastatin Tocris 2318 AQ 10000
inhibitor (non- Biosciences
oncology)
Pevonedistat NAE inhibitor H. Metabolic modifier Investigational ChemieTek CT-M4924 DMSO 10000
(Ph 1)
URB597 FAAH inhibitor H. Metabolic modifier Investigational Selleck S2631 DMSO 1000
(Ph 1)
CPI-613 pyruvate dehydrogenase, H. Metabolic modifier Investigational Selleck S2776 DMSO 10000
alpha-ketoglutarate (Ph 2)
dehydrogenase
inhibitor
AVN944 IMPDH inhibitor H. Metabolic modifier Investigational VX-944 ChemieTek CT-AVN944 DMSO 10000
(Ph 2)
Triapine ribonucleotide reductase H. Metabolic modifier Investigational 3-AP Selleck S7470 DMSO 10000
inhibitor (Ph 2)
Daporinad NAMPT inhibitor H. Metabolic modifier Probe DGB Axon Medchem Axon 1546 DMSO 1000
AGI-5198 IDH1 R132H/R132C H. Metabolic modifier Probe IDH-C35, AGI 5198 Selleck S7185 DMSO 10000
inhibitor
TH588 MTH1 inhibitor H. Metabolic modifier Probe KI/Helleday TH588 DMSO 25000
393
(continued)
Table 1
394
(continued)
High phase High

AGI-6780 IDH2-R140Q inhibitor H. Metabolic modifier Probe Medchemexpress HY-15734 DMSO 10000
Liye He et al.
GSK923295 CENP-E inhibitor I. Kinesin inhibitor Investigational Medchemexpress HY-10299 DMSO 10000
(Ph 1)
SB 743921 Mitotic inhibitor. I. Kinesin inhibitor Investigational Selleck S2182 DMSO 100
Eg5/KSP inhibitor (Ph 2)
ARRY-520 KSP/Eg5 inhibitor I. Kinesin inhibitor Investigational Medchemexpress HY-15187 DMSO 1000
(Ph 2)
Rofecoxib COX-2 inhibitor J. NSAID Approved Vioxx ChemieTek CT-RX001 DMSO 10000
Celecoxib Selective COX-2 inhibitor J. NSAID Approved National Cancer NSC 719627- DMSO 10000
Institute M/1
CUDC-305 HSP90 inhibitor K. HSP inhibitor Investigational DEBIO-0932 ChemieTek CT-CU305 DMSO 10000
(Ph 1)
Tanespimycin HSP90 inhibitor K. HSP inhibitor Investigational 17-AAG Selleck S1141 DMSO 10000
(Ph 2)
Alvespimycin HSP90 inhibitor K. HSP inhibitor Investigational 17-DMAG Selleck S1142 DMSO 1000
(Ph 2)
BIIB021 HSP90 inhibitor K. HSP inhibitor Investigational Selleck S1175 DMSO 10000
(Ph 2)
Luminespib HSP90 inhibitor K. HSP inhibitor Investigational AUY922 ChemieTek CT-AUY922 DMSO 1000
(Ph 2)
Onalespib HSP90 inhibitor K. HSP inhibitor Investigational Medchemexpress HY-14463 DMSO 2500
(Ph 2)
Ganetespib HSP90 inhibitor K. HSP inhibitor Investigational Selleck S1159 DMSO 1000
(Ph 3)
VER 155008 HSP70 inhibitor K. HSP inhibitor Probe Axon Medchem Axon 1608 DMSO 10000
Radicicol HSP90 inhibitor K. HSP inhibitor Probe Monorden Tocris 2/1/1589 DMSO 10000
Biosciences
Pilocarpine Non-selective muscarinic X. Other Approved Salagen Tocris 694 DMSO 40000
receptor agonist Biosciences
Anagrelide PDE-3, PLA2 inhibitor X. Other Approved Agrylin, Tocris 2432 DMSO 10000
Xagrid Biosciences
Mepacrine Unclear. PLA2 inhibitor. X. Other Approved Quinacrine, Achricrine Sigma-Aldrich Q3251 AQ 50000
NF-kB inhibitor, p53
activator
Plerixafor CXCR4 antagonist X. Other Approved JM 3100, AMD 3100 Mozobil Cayman 10011332-2 AQ 10000
Chemical
Company
Fingolimod S1PR antagonist X. Other Approved Gilenya LC Laboratories F-4633 DMSO 10000
Vismodegib Smothened X. Other Approved HhAntag691 Erivedge LC Laboratories V-4050 DMSO 10000
(Hh) inhibitor
Deferoxamine Iron chelator X. Other Approved desferrioxamine, Sigma-Aldrich D9533 AQ 10000
(non- DFOM
oncology)
Itraconazole antifungal, hedgehog X. Other Approved Selleck S2476 DMSO 5000
signaling inhibitor (non-
oncology)
NVP-LGK974 PORCN inhibitor X. Other Investigational LGK974 Selleck S7143 DMSO 10000
(Ph 1)
Sonidegib Smothened (Hh) inhib X. Other Investigational LDE225, erismodegib ChemieTek CT-LDE225 DMSO 10000
(Ph 2) (USAN)
2-methoxyestradiol Angiogenesis inhibitor X. Other Investigational 2ME2 Cayman 13021 DMSO 10000
(Ph 2) Chemical
Company
MK-0752 gamma-secretase/notch X. Other Investigational Selleck S2660 DMSO 1000
inhibitor (Ph 2)
Varespladib Secretory phospholipase X. Other Investigational A-001 ChemieTek CT-VARE DMSO 10000
A2 inhibitor (Ph 2)
1-methyl-D- Indolamine X. Other Investigational 1-methyl-D-tryptophan Sigma-Aldrich 452483 AQ 5000
tryptophan 2,3-dioxygenase 1 and (Ph 2)
2 inhibitor
Glasdegib Smo inhibitor X. Other Investigational Medchemexpress HY-16391 DMSO 1000
(Ph 2)
Tarenflurbil Gamma-secretase X. Other Investigational (R)-Flurbiprofen Cayman 70255 DMSO 10000
inhibitor (Ph 3) Chemical
Company
Tosedostat Aminopeptidase inhibitor X. Other Investigational Tocris 3595 DMSO 10000
(Ph 3) Biosciences
395
(continued)
Table 1
(continued)
396
High phase High

Cilengitide alphaVbeta3 integrin X. Other Investigational NSC 707544 Selleck S7077 DMSO 10000
inhibitor (Ph 3)
Darapladib lipoprotein-associated X. Other Investigational Selleck S7520 DMSO 1000
Liye He et al.
phospholipase A2 (Ph 3)
inhibitor
Marimastat MMP-9, MMP-1, X. Other Investigational Selleck S7156 DMSO 10000
MMP-2, MMP-14, (Ph 3)
MMP-7 inhibitor
Galiellalactone STAT3-DNA interaction X. Other Probe Santa Cruz sc-202165-6 DMSO 25000
inhibitor Biotechnology
PF-3845 FAAH inhibitor X. Other Probe Selleck S2666 DMSO 10000
15D-PGJ2 Endogenous PPARγ X. Other Probe 15D-PGJ2, 15-deoxy Merck 538927-2 DMSO 3000
ligand, prostaglandin, delta(12,14)
NFkB signaling prostaglandin J2
inhibitor
Stattic STAT3 SH2 domain X. Other Probe Stattic Tocris 2798-03-01 DMSO 50000
TRAM-34 intermediate- X. Other Probe Selleck S1160 DMSO 1000
conductance Ca2+-
activated K+ channel
inh.
deltarasin Ras-PDEdelta inhibitor X. Other Probe deltarasin Chemietek CT-DELT DMSO 10000
NSC348884 NPM1 oligomerization X. Other Probe Axon Medchem Axon 1402 DMSO 50000
inhibitor
ONX-0914 LMP7 (immuno- X.Other Probe PR-957 Selleck S7172 DMSO 10000
proteasome)
NMS-873 p97/VCP inhibitor X. Other Probe Selleck S7285 DMSO 10000
ML323 USP1-UAF1 inhibitor X. Other probe Selleck S7529 DMSO 10000
GSK2830371 Wip1 inhibitor X. Other probe ChemieTek CT-GSK283 DMSO 5000
BCI Dusp6 inhibitor X. Other probe BCI, NSC 150117 Sigma B4313 DMSO 50000
MST-312 Telomerase inhibitor X. Other probe Telomerase Inhibitor IX Sigma M3949 DMSO 10000
SH-4-54 STAT3 inhibitor X. Other probe Selleck S7337 DMSO 25000
Compound name, classification, mechanism of action, clinical phase and supplier information are displayed
9. We provide an R-package SynergyFinder to calculate the drug

synergy scores using four different reference models, acknowl-
edging the fact that the optimal method for standardization of
drug combination data analysis remains an open question (28).
The users are therefore advised to apply all the models for their
data and report a drug combination that can show a detectable
level of synergy scores irrespective of the model in selection.
10. A strong synergy in a drug combination, as revealed using the
synergy landscape analysis, might not be sufficient to warrant
the next level confirmatory analysis if the synergy does not lead
to sufficient overall responses. Therefore, the synergy scoring is
always advised to be combined with the raw dose-response
matrix data visualized in Fig. 4 to provide an overview of the
extra benefits of drug combinations compared to single drugs.
Acknowledgments
This work was supported by the Academy of Finland (grants

272437, 269862, 279163, 295504, and 292611 for TA, 272577
and 277293 for KW); the Integrative Life Science Doctoral Pro-
gram at the University of Helsinki (LH), the Sigrid Jusélius Foun-
dation (KW) and the Cancer Society of Finland (JT, TA, and KW).
This project has received funding from the European Union’s
Horizon 2020 research and innovation program 2014–2020
under Grant Agreement No 634143 (MedBioinformatics).
References
1. Vogelstein B, Papadopoulos N, Velculescu VE 6. Gillies RJ, Verduzco D, Gatenby RA (2012)
et al (2013) Cancer genome landscapes. Sci- Evolutionary dynamics of carcinogenesis and
ence 339:1546–1558 why targeted therapy does not work. Nat Rev
2. Pemovska T, Kontro M, Yadav B et al (2013) Cancer 12:487–493
Individualized systems medicine strategy to tai- 7. Mathews Griner LA, Guha R, Shinn P et al
lor treatments for patients with chemorefrac- (2014) High-throughput combinatorial
tory acute myeloid leukemia. Cancer Discov screening identifies drugs that cooperate with
3:1416–1429 ibrutinib to kill activated B-cell-like diffuse
3. Yang W, Soares J, Greninger P et al (2013) large B-cell lymphoma cells. Proc Natl Acad
Genomics of drug sensitivity in cancer Sci U S A 111:2349–2354
(GDSC): a resource for therapeutic biomarker 8. Crystal AS, Shaw TA, Sequist VL et al (2014)
discovery in cancer cells. Nucleic Acids Res Patient-derived models of acquired resistance
D41:D955–D961 can identify effective drug combinations for
4. Seashore-Ludlow B, Rees MG, Cheah JH et al cancer. Science 346:1480–1486
(2015) Harnessing connectivity in a large-scale 9. Pemovska T, Johnson E, Kontro M et al
small-molecule sensitivity dataset. Cancer Dis- (2015) Axitinib effectively inhibits
cov 5:1210–1223 BCR-ABL1 (T315I) with a distinct binding
5. Tang J, Aittokallio T (2014) Network pharma- conformation. Nature 519:102–105
cology strategies toward multi-target antican- 10. Kulesskiy E, Saarela J, Turunen L et al (2016)
cer therapies: from computational models to Precision cancer medicine in the acoustic dis-
experimental design principles. Curr Pharm pensing era: ex vivo primary cell drug sensitiv-
Des 20:20–36 ity testing. J Lab Autom 21:27–36
398 Liye He et al.
11. Haltia UM, Andersson N, Yadav B et al (2017) 19. Greco WR, Bravo G, Parsons JC (1995) The
Systematic drug sensitivity testing reveals syn- search for synergy: a critical review from a
ergistic growth inhibition by dasatinib or response surface perspective. Pharmacol Rev
mTOR inhibitors with paclitaxel in ovarian 47:331–385
granulosa cell tumor cells. Gynecol Oncol 20. Zhao W, Sachsenmeier K, Zhang L et al (2014)
144:621 A new bliss independence model to analyze
12. Saeed K, Rahkama V, Eldfors S et al (2017) drug combination data. J Biomol Screen
Comprehensive drug testing of patient-derived 19:817–821
conditionally reprogrammed cells from 21. Yadav B, Wennerberg K, Aittokallio T et al
castration-resistant prostate cancer. Eur Urol (2015) Searching for drug synergy in complex
71:319. https://doi.org/10.1016/j.eururo. dose-response landscapes using an interaction
2016.04.019 potency model. Comput Struct Biotechnol J
13. Berenbaum MC (1989) What is synergy. Phar- 13:504–513
macol Rev 41:93–141 22. Szwajda A, Gautam P, Karhinen L et al (2015)
14. Loewe S (1953) The problem of synergism and Systematic mapping of kinase addiction combi-
antagonism of combined drugs. Arzneimittel- nations in breast cancer cells by integrating
forschung 3:285–290 drug sensitivity and selectivity profiles. Chem
15. Bliss CI (1939) The toxicity of poisons applied Biol 22:1144–1155
jointly. Ann Appl Biol 26:585–615 23. Gautam P, Karhinen L, Szwajda A et al (2016)
16. Chou TC (2006) Theoretical basis, experimen- Identification of selective cytotoxic and syn-
tal design, and computerized simulation of synthetic lethal drug responses in triple negative
ergism and antagonism in drug combination breast cancer cells. Mol Cancer 15:34
studies. Pharmacol Rev 58:621–681 24. Karjalainen R, Pemovska T, Majumder M et al
17. Boik JC, Narasimhan B (2010) An R package (2017) JAK1/2 and BCL2 inhibitors synergize
for assessing drug synergism/antagonism. J to counteract bone marrow stromal cell-
Stat Softw 34:6 induced protection of AML. Blood 130:789
18. Ritz C, Baty F, Streibig JC (2005) Bioassay
analysis using R. J Stat Softw 12:5
INDEX
A Cross validation (CV) .........................178, 180–183, 286
Absolute neutrophil count (ANC)..............304–308, 319 D

Algorithm
digital sorting algorithm (DSA) ............................. 244 Databases
Prize-Collecting Steiner Forest Biological General Repository For Interaction
Algorithm.................................................15, 16 Datasets (BIOGRID) ....................... 14, 95, 97
Alkylation............................................. 152–153, 155–156 Cancer Cell Line Encyclopedia (CCLE) ...... 278–279
Antibody ................................................................ 64, 152, The Cancer Genome Atlas (TCGA) ..................... 6, 9,
153, 155, 157, 262, 264, 270, 271, 286 13, 15, 69, 70, 251, 253–256, 279, 287
Catalog of Somatic Mutations in Cancer
B (COSMIC) .................................................. 4–6
Connectivity Map (CMap)/LINCS ............. 280, 286
Bicinchoninic acid (BCA) ........................... 152, 155, 156 Encyclopedia Of DNA Elements
Biomarker ............................................. 60–63, 73–74, 76, (ENCODE) .............................................13, 15
209, 244, 256, 262, 265–272, 287, 289 Gene Expression Omnibus ............................ 247, 248
Biophysical stress .................................................. 225, 238 Human Protein Reference Database
(HPRD) .............................................. 112, 126
C
Ingenuity Pathway Analysis (IPA)............................ 65
Cancer iRefIndex ............................................... 14, 16, 19, 23
breast cancer .................................................. 3, 55–76, Kyoto Encyclopedia of Genes and
84, 134–136, 167–187, 195, 196, 287, 290, Genomes (KEGG)..........................14, 44, 284
333–339, 364 NCI60 database ............................................. 278, 279
chronic myeloid leukemia (CML)......................... 104, Protein Analysis Through Evolutionary
298–299, 304–322 Relationships (PANTHER) ............... 154, 161
ductal carcinoma in situ (DCIS) ...............59, 73, 345 Search Tool for the Retrieval of Interacting
Cancer stem cells Genes/Proteins (STRING) ................ 14, 111,
breast cancer stem cells ................. 334–337, 339–344 154, 161
stem cell niche ............................... 333, 335–337, 339 TRANSFAC .............................................................. 15
Cell culture ................................. 107, 352, 354–357, 364 UniProt.................................................. 108, 120, 141
Chip-Seq .......................................................................... 17 Data integration ..................................................... 21, 134
Clustered Regularly Interspaced Short Deoxyribonucleic acid (DNA) ............................... 14, 17,
Palindromic Repeats (CRISPR) ................... 84 27–51, 55, 66–69, 71, 168, 209, 256, 262,
Combination index (CI)...................................... 352, 353 280, 334, 366–369, 386
Computational biology................ 58, 104–107, 110–117 complementary DNA (cDNA)................................. 63
Computational modeling Dialogue on reverse engineering assessment
agent based modeling .................................... 343–345 and methods (DREAM) .................... 279, 290
Highest single agent model (HSA).............. 352, 353, Differentially methylated cytosines (DMCs) ..........28–31,
360, 362 40, 42–44, 48–50
universal response surface approach (URSA) ........ 353 Differentially methylated regions (DMRs).............28–30,
Zero Interaction Potency (ZIP)................... 353, 360, 40–46, 48–50
362, 363 Differentiated cells (DC) .............................................. 302
Computer tomography................................................... 62 Driver mutation ................................................................ 3
Contrast enhanced magnetic resonance imaging Drug combination plate design ................. 354, 357, 358
(CE-MRI) ...................................227–231, 237 Drug target identification............................280–288, 290
https://doi.org/10.1007/978-1-4939-7493-1, © Springer Science+Business Media, LLC 2018
399
CANCER SYSTEMS BIOLOGY: METHODS AND PROTOCOLS
400 Index
E L
Enhanced chemiluminescence (ECL)................. 152, 155 Liquid chromatography (LC)
Enhanced reduced representation bisulfite high-performance liquid chromatography
sequencing (ERRBS)..............................28–30, (HPLC)..................... 151, 153, 154, 158, 159
33–36, 39, 46, 49–51 reversed phase liquid chromatography (RPLC) .... 106
Epigenomics ...................................................9, 13–25, 27 Locked nucleic acids (LNA) ........................................... 75
Epipolymorphism......................................................46–48 Loss of function ................................................. 61, 83–85
Epitope Retrieval (ER) ............................... 262, 264, 271 Luminescence assay......................................................... 87
F M
Formalin-fixed paraffin-embedded (FFPE) ................150, Machine learning
262, 270 support vector machine (SVM)..................... 175, 179
support vector regression (SVR)...........245–246, 252
G Magnetic resonance imaging (MRI)................... 229–230
Gene expression profiles (GEPs)........................ 244–247, Magnetic resonance spectroscopy (MRS).................... 169
252, 280, 284 Manhattan plot.............................................................. 282
Mascot ................................................................. 108, 109,
Gene ontology (GO) .............................................. 43, 44,
154, 161 154, 159, 160
Gene Set Enrichment Analysis (GSEA) ......112, 284–286 Mass spectrometry
data-dependent acquisition (DDA) .............. 106, 107
Genome Wide Associate Studies (GWAS).......... 279–282
Genomic sequencing............................................ 277–291 data-independent acquisition (DIA)............. 106–107
GitHub .....................................................................16–18, higher-energy collisional dissociation (HCD)....... 158
20, 23, 24, 119 multiple reaction monitoring (MRM) ................... 106
Glioblastoma ........................................................ 149–162 selected reaction monitoring (SRM) ..................... 106
Graphical user interphases (GUIs)............................... 182 tryptic digestion ............................ 152–153, 155–156
Mathematical model
H deterministic model ...................... 193–216, 340–341
finite difference method ......................................... 227
Hematoxylin and Eosin (H&E) staining............ 239, 255 Gompertz model................................... 194, 300, 301
High-throughput data ............................................ 13, 14, nominal optimization problem .............306, 322–324
64, 68–69, 72 objective function.......................................... 236, 237,
Histograms .......................................................... 138, 141, 299, 300, 307, 315, 316, 319, 321
187, 203, 268 partial differential equation (PDE) ...... 195, 203, 204
petri nets ......................................................... 341, 342
I
population model.......................... 203–208, 215, 306
Immunoblotting ......................................... 152, 155, 161 robust optimization problem (ROP).................... 307,
Immunoprecipitation (IP) .................................... 70, 151, 316, 321, 324–327
153, 157, 158 stochastic model ............................................ 193–216,
In silico ....................................................... 24, 58, 65, 70, 338, 340–342
125, 228, 254, 256, 338–339 Poisson processes ............194–195, 197–212, 217
Intra-tumor heterogeneity.............................................. 70 toxicity modeling ...................................303–304, 306
In vivo ...................................................................... 49, 56, toxicity uncertainty ...................................297–329
66, 110, 111, 151, 194, 228–229, 238, tumor growth model .............................231, 300–303
308, 352 Matlab..............................................................6, 117, 162,
172, 228, 239, 265
K Metastatic emission process................................. 193–216
Methylation heterogeneity (MH) ........ 28–30, 46–48, 51
Kinase activity analysis (KAA) ............................. 112–113
Microarray .........................................................63, 70, 74,
Kinase activity scoring (kinact)................... 105, 117, 127
135, 244, 245, 247–249, 254–256, 283, 284
Kinase-Substrate Enrichment Analysis (KSEA)..........105,
Microarray microdissection with analysis of
114–124, 126
differences (MMAD)................................... 244
Kinase-substrate relationships .............................. 97, 104,
Microenvironment ................................ 28, 268, 333–345
110–116, 118, 121–122, 125–127
miRbase .............................................................. 56, 65, 74
Index 401
Multi-dimensional data.............................................67–70 phosphopeptide enrichment: phosphotyrosine
Multiplexed immunofluorescence peptide immunoprecipitation ..................... 153
BondRX autostainer....................................... 262, 263 phosphosite assignment ............................108–109
fluorescence scanning..................................... 262–265 Python ..................................................................... 23, 39,
Hoechst 33342 ..................................... 262, 264, 272 85, 87, 94, 96, 105, 117–120, 122, 123, 127,
image artifacts................................................. 265, 272 254, 364
quantitative image analysis ....................263, 265–270
Multi scatter plot.................................................. 139, 140 Q
Mutational pattern ........................................................ 3–9
Quantitative mass spectrometry
Mutual exclusivity ............................................................. 4 isobaric tags for relative and absolute quantitation
Myeloid derived suppressor cells (iTRAQ).................................... 107, 150, 153,
(MDSCs)............................................. 243, 336
156, 158–160
stable isotope labeling by metabolic incorporation
N
of amino acids (SILAC) .............................. 107
Natural killer cells............................................................ 57 tandem mass tags (TMT) .............................. 107, 156
Network biology
hub node ................................................................... 22 R
node ........................................................14, 18, 20–24 R
protein-protein interaction network Cellhts2 (R-package) .............................85–92, 96, 97
(PPI).........................................................14, 15 Gplots (R-package) .............................................85, 96
Neural networks (NNs) .............................. 179, 180, 290
limma package ................................................ 247, 283
Nuclear magnetic resonance (NMR) .................. 167–187 Radio frequency (RF) ................................................... 168
Radioimmunoprecipitation assay (RIPA) ........... 152, 155
O
Reduced representation bisulfite sequencing
Omics Integrator (RRBS) ........................................................... 28
Forest ............................................................ 15, 18–24 Reduction ............................................................ 152–153,
Garnet .....................................................15, 17–20, 23 155–156, 301, 303
Perseus ............................................................ 133–146 Regions of interest (ROI)........................... 230, 239, 269
Synergyfinder................................................. 354, 355, Ribonucleic acid (RNA)
359, 360, 397 messenger RNA (mRNA). 14, 15, 55–58, 63, 66–72,
Targetscan.................................................................. 65 75, 256, 280
Trim Galore .................................................. 29, 30, 34 micro RNAs (miRNA) ............................... 55–76, 335
Optimal cutting temperature (OCT)..........150, 154–155 primary miRNA (pri-miRNA)............................56, 57
Optimal therapy design ....................................... 338, 397 short hairpin RNA (shRNA) ..............................84, 85
small interfering RNA (siRNA).......................... 83–98
P RNA-Seq .........................................................17, 23, 245,
Patient-derived xenograft (PDX) ........................ 150, 334 247, 248, 251, 253–256, 283, 334
Phenotypic readouts ..................................................... 354
S
Phosphate buffered saline (PBS)......................... 152, 155
Phosphotyrosine signaling............................................ 151 Signaling pathway ................................................... 4, 5, 8,
Polymerase chain reaction (PCR) .................................. 33 118, 125, 126, 335, 339
reverse transcription polymerase chain Single nucleotide polymorphisms (SNP)....................279,
reaction (RT-PCR) ........................................ 63 281, 282
Precision medicine ........................................................ 277 Software tools
Proteomic ........................................................13–25, 107, BANDIT.................................................................. 286
108, 133–146, 150, 256, 278, 290 Bismark ............................... 29–31, 35–38, 47, 50–51
phosphoproteomics Bliss ................................................352, 353, 360, 362
phosphopeptide enrichment ChIPseeqer .............................................29–31, 42, 43
phosphopeptide enrichment: immobilized CIBERSORT.................................................. 243–257
metal affinity chromatography (IMAC).... 105, Cytoscape........................................... 8, 16, 20, 23, 24
153, 157–158 DAISY............................................................. 287–288
phosphopeptide enrichment: metal oxide FastQC............................................. 29–31, 33–35, 49
affinity chromatography (MOAC) ............. 105 FIMMcherry......................... 353–355, 357, 358, 364
402 Index
Statistical methods univariate analysis ...................................170, 183–185
Analysis of Variance (ANOVA) .................... 142–144, unsupervised analysis ................................... 40, 43, 48
146, 281 Synergy scoring .................................................... 351–397
Benjamini-Hochberg Adjustment.......................... 185 Synthetic dosage lethality ............................................. 287
cumulative variance plot ................................ 172, 173 Synthetic lethality (SL) ........................... 84, 95, 287, 288
false discovery rate (FDR) .............................. 18, 108,
135, 136, 142–143, 146, 160, 184, 185 T
generalized linear model (GLM) ........................... 281 Terminal deoxynucleotidyl transferase dUTP nick
hierarchical cluster analysis ....................171, 174–175
end labeling (TUNEL) ...........................64–65
latent variables ................................................. 66, 171, TissueCypher® .....................................263, 265, 267, 269
176–180, 182, 187 Total growth inhibition (TGI) ............................ 278, 279
leave-one-out-analysis ............................................. 181
Transcription factor (TF)........................ 15–20, 112, 339
likelihood ratio .......................................5, 7, 183–184 Transcriptomics ................................................. 13–25, 28,
linear least-square regression (LLSR) .................... 244 66, 133, 334, 345
linear mixed-effects models (LMM) ............. 183–184
Translational bioinformatics ................................ 133–146
median absolute deviation (MAD) ............. 88, 89, 98 Tumor evolution .............................................. 27–51, 298
memorylessness ..................................... 196, 197, 199 Tumor growth prediction............................230, 234–237
model diagnostic statistics ............................. 181, 182
Tumorigenesis ......................................... 8, 27, 48, 61, 66
multiple hypothesis testing ........................... 135, 141, Tumor infiltrating leukocytes (TILs) ..........................243,
142, 246, 282 244, 248–255
multivariate analysis ....................................... 170–183 Tumor progression ....................................................8, 43,
partial least squares discriminant analysis (PLS-DA)
48, 83, 345
175, 177–179, 181 Tyrosine kinase inhibitors (TKI)........................ 298, 299,
partial least squares regression (PLSR) ................. 176, 305, 322
178, 181
Pearson correlation ..................................92, 139, 160 U
permutation testing........................................ 182–183
principal component analysis (PCA)............ 139, 140, Untranslated region (UTR) .............................. 57, 65, 66
171–174, 176–178
W
response surface model ........................................... 353
root mean square error (RMSE) ...........176–178, 181 Whole-genome bisulfite sequencing (WGBS).............. 28,
scree plot......................................................... 173, 186 51, 334
Tukey’s honestly significant difference ................. 142,
143, 146

Cancer Systems Biology - Methods and Protocols (PDFDrive)

Uploaded by

Copyright:

Available Formats

Cancer Systems Biology - Methods and Protocols (PDFDrive)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cancer Systems Biology - Methods and Protocols (PDFDrive)

Uploaded by

Copyright:

Available Formats

Methods in

Molecular Biology 1711

Louise von Stechow Editor

For further volumes:

Methods and Protocols

Louise von Stechow

ISSN 1064-3745 ISSN 1940-6029 (electronic)

Library of Congress Control Number: 2017961108

© Springer Science+Business Media LLC 2018

Printed on acid-free paper

This Humana Press imprint is published by Springer Nature

Copenhagen, Denmark Louise von Stechow

PART I SYSTEMS BIOLOGY OF THE CANCER GENETIC AND EPIGENETIC

PART II SYSTEMS ANALYSES OF SIGNALING NETWORKS IN CANCER CELLS

PART III SYSTEMS ANALYSIS OF CANCER CELL METABOLISM

9 Prediction of Clinical Endpoints in Breast Cancer Using NMR

PART IV SYSTEMS BIOLOGY OF METASTASIS AND TUMOR/

10 Stochastic and Deterministic Models for the Metastatic Emission Process:

PART V MODELING DRUG RESPONSES IN CANCER CELLS

14 Bioinformatics Approaches to Predict Drug Responses

TERO AITTOKALLIO Institute for Molecular Medicine Finland (FIMM), University of

NIKLAS HARTUNG Department of Clinical Pharmacy and Biochemistry, Freie Universit€ at

JULIO SAEZ-RODRIGUEZ Joint Research Center for Computational Biomedicine

Systems Biology of the Cancer Genetic and Epigenetic

Detection of Combinatorial Mutational Patterns in Human

Genetic aberrations and deleterious environment exposure orches-

Gene 1 Mutually exclusive

Mutant Wild type

[5]. Previous experimental and statistical analyses have consistently

mutations which will not change an amino acid (marked as “coding

This pipeline has been shown to be highly effective and efficient in

The pipeline proposed in this protocol has been successfully tested

Genes Genes Genes

(9,2) (9,2) (9,2)

Mutant Wild type

3.2 Calculation 1. Exclude genes (columns) that do not exceed a prescribed

where P(g1 ¼ 1) and P(g1 ¼ 1, g2 ¼ 1) correspond to the

4. Determine the threshold of the likelihood ratio (LRcomb) for

where n, n1, n2, n12 represent the numbers of total samples,

2. Calculate the significance level of the mutually exclusive pattern

4 Interpretation of the Results

Both co-mutational and mutually exclusive patterns of gene pairs

1. If combinatorial mutational pattern analysis needs to be con-

This work was partially supported by the Beijing Normal University

Discovering Altered Regulation and Signaling Through

As biologists gain access to increasing amounts of data, the chal-

However, network analysis is not as simple as just mapping

2.1 Finding 1. Transcriptomics data, i.e., differential gene expression between

2.2 Network 1. Prize-collecting Steiner tree algorithm executable (msgsteiner

2. Interactome file indicating all known interactions between

Garnet will run through several steps, informing you on the

formatted in three tab-delimited columns. Each line should

Output files are described in the README file on the Omics

This command will randomly redistribute your prizes among

negative prizes can be useful. First, if you have reason to believe

removing these nodes from the network is not desirable, as there

1. Problems in running Omics integrator can originate from

6. Similar to transcriptomic data, your proteomics data will indi-

This work was supported by grants from National Institute of