Kristen Sen 2014

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

REVIEWS

Principles and methods of integrative


genomic analyses in cancer
Vessela N. Kristensen1–3, Ole Christian Lingjærde2,4, Hege G. Russnes1,2,5,
Hans Kristian M. Vollan1,2,6, Arnoldo Frigessi7,8 and Anne-Lise Børresen-Dale1,2
Abstract | Combined analyses of molecular data, such as DNA copy-number alteration,
mRNA and protein expression, point to biological functions and molecular pathways
being deregulated in multiple cancers. Genomic, metabolomic and clinical data from
various solid cancers and model systems are emerging and can be used to identify novel
patient subgroups for tailored therapy and monitoring. The integrative genomics
methodologies that are used to interpret these data require expertise in different
disciplines, such as biology, medicine, mathematics, statistics and bioinformatics, and
they can seem daunting. The objectives, methods and computational tools of integrative
1
Department of Genetics,
genomics that are available to date are reviewed here, as is their implementation in
Institute for Cancer Research, cancer research.
Oslo University Hospital, The
Norwegian Radium Hospital,
Montebello, 0310 Oslo,
Norway. Leonardo da Vinci’s collection of drawings entitled Cancer is currently one of the most well-characterized
2
K.G. Jebsen Centre for Breast Studies of the Human Body and Principles of Anatomy pathological systems at the molecular level. Most
Cancer Research, Institute for started a renaissance in studying human anatomy and (if not all) cancers involve genetic aberrations in the
Clinical Medicine, Faculty of
pathology that led to a better understanding of the germ line and/or at the somatic level. By producing a
Medicine, University of Oslo,
0313 Oslo, Norway. mechanics, proportions and functions of the human complete catalogue of inherited and acquired muta-
3
Department of Clinical body. Today, we live in an era when biological sciences tions, with functional consequences of each mutation
Molecular Oncology, are marked by the same exploratory drive, but this time with respect to tumour type, it is hoped that one can,
Division of Medicine, it is at an invisible, molecular level. The accumulation for example, assess the metastatic potential of a tumour
Akershus University Hospital,
1478 Ahus, Norway.
of enormous quantities of molecular data has led to and suggest the most promising treatment 7,8. Although
4
Division for Biomedical the emergence of ‘systems biology’ — a branch of sci- data are rapidly accumulating from various cancer-
Informatics, Department of ence that discovers the principles that underlie the basic profiling projects, interpreting these data is not easy.
Computer Science, University functional properties of living organisms, starting from The development and progression of a tumour is
of Oslo, 0316 Oslo, Norway.
interactions between macromolecules1–4. Integrative a dynamic biological and evolutionary process. It
5
Department of Pathology,
Oslo University Hospital, genomics is based on the fundamental principle that any involves composite organ systems, with genomes
0450 Oslo, Norway. biological mechanism builds upon multiple molecular shaped by gene aberrations, epigenetic changes, the
6
Department of Oncology, phenomena, and only through the understanding of the cellular biological context, characteristics that are spe-
Division of Cancer, interplay within and between different layers of genomic cific to the individual patient, and environmental influ-
Surgery and Transplantation,
Oslo University Hospital,
structures can one attempt to fully understand pheno- ences9,10. Sophisticated statistical and mathematical
0450 Oslo, Norway. typic traits. Therefore, principles of integrative genomics techniques have been developed for the analysis, inter-
7
Statistics for Innovation, are based on the study of molecular events at different pretation and validation of biological data, and novel
Norwegian Computing Center, levels and on the attempt to integrate their effects in a computational techniques and tools are continuously
0314 Oslo, Norway.
functional or causal framework. Although perhaps less emerging. In principle, mathematical modelling of pat-
8
Department of Biostatistics,
Institute of Basic Medical aesthetically pleasing than the drawings of Leonardo da tern formation — using methods from interacting par-
Sciences, University of Oslo, Vinci, the new visualization tools based on mathemati- ticle systems, system dynamics and hierarchical models
PO Box 1122 Blindern, cal models can present the ‘digital universe of informa- — can be used to study tumour formation and growth.
0317 Oslo, Norway. tion’ in a form that is of use for the treatment of a cancer In practice, statistics and information theory constitute
Correspondence to A.-L.B.-D.
e-mail: a.l.borresen-dale@
patient 5 and for revealing the existence and principles essential methodologies in the analysis of biological
medisin.uio.no of molecular interactions that govern fundamental data sets. These methodologies are the subject of this
doi:10.1038/nrc3721 biological mechanisms6. Review. We discuss the different computational models

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 299

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

Key points
linking GWAS hits with transcription factors that are
known to function as master regulators, Fletcher et al.17
• Genomic, metabolomic and clinical data on a range of solid cancers and model found that the risk associated with altered fibroblast growth
systems are emerging and can be used to identify novel patient subgroups for tailored factor receptor 2 (FGFR2) signalling is due to altered
therapy and monitoring. activity of the oestrogen receptor-α (ERα)-associated
• Molecular markers identified at the DNA, mRNA, microRNA and protein levels have transcriptional network.
been used to develop profiles associated with taxonomy, tumour aggressiveness,
Various molecular markers, which have been identi-
response to therapy and patient outcome.
fied at DNA, mRNA, microRNA (miRNA) and protein
• The information content is higher in integrated analysis than in any of the molecular
levels studied separately, and a large number of statistical methods for the integration
levels, have been used to develop profiles that are asso-
of ‘omics’ data have emerged. ciated with taxonomy, tumour aggressiveness, response
• The access to large data sets that have been made available by the International to therapy and patient outcome18,19. In addition, complex
Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) has made biological features at the cellular level, such as histopatho-
it possible to compare the performance of some of the statistical methods of omic logical and radiological images, which were traditionally
data integration on the same data set. evaluated and scored visually by a trained expert, are now
• These recent developments will fundamentally alter the way that we statistically subjected to computational quantification20,21. However,
model and evaluate treatment strategies, from identifying patient groups that small or no overlaps between predictive profiles from
respond to treatment above random, to identifying pathways and biological entities different sources persist because of the low statistical
that are druggable and altered above random. power of these studies and the different clinical strata
• A shift from large randomized clinical trials towards treatment modalities that are used in each study, among other differences. Pooling data
tailored for stratified patient groups, down to N-of-1 trials, in which a single patient sets, combining profiles at various levels and analysing
constitutes the entire trial, will require new statistical methods.
the data in a compendium — such as the GeneSapiens
• Outsourcing data and searching for solutions in open competition will allow new database22, the Integrative Multi-Species Prediction
ideas to instantly emerge to ‘embrace the complexity’ that is associated with the
(IMP) server 23, Search-Based Exploration of Expression
exponentially increasing amounts of data and find new ways of shared analysis.
Compendium (SEEK), ProfileChazer 24,25 Oncomine26,
Rembrandt 27 and similar tools — can lead to more reli-
that are being used to assess these data and how these able molecular signatures and thereby more specific diag-
models have been applied to better understand cancer nosis and treatment of cancer patients. The joint analysis
development, progression and treatment response. of multiple data domains, each of which reflect various
dimensions of a biological function, has the potential to
Multi-dimensional molecular data sources generate explanatory power that cannot be obtained with
Continuous improvements in the rate, accuracy and reso- one data type alone.
lution of ‘omics’ data and biochemical features that can be In order to access these data and to carry out some of
observed in a tumour or a patient have set the stage for the integrative analyses detailed below, storage and com-
the integration of many sources of information, includ- puting platforms such as the Bionimbus, Bioconductor28,
ing data from epidemiological studies, clinical studies and CytoScape29,30, IntOGen31, OncoDrive32 and Synapse
genomic and metabolomic profiling (see FIG. 1, which uses (see Further information) have been designed to ena-
breast cancer as an example). Much of these data are being ble scientists to exchange data sets, algorithms and
housed in different databases. For example, more than mathematical models of cancer. Recently, the idea of
1.5 million individual mutations in 25,606 genes in almost sharing and interactive collaboration towards solving a
950,000 samples have been described in the Catalogue certain biological problem was taken forward by Sage
of Somatic Mutations in Cancer (COSMIC) database11. Bionetworks in the competition-based crowdsourcing
Compiling these types of databases is a primary goal of Dialogue for Reverse Engineering Assessments and
Information theory several consortia, such as the Cancer Genome Project12, Methods (DREAM) breast cancer prognosis challenge
A branch of applied the International Cancer Genome Consortium (ICGC)13 (BCC)33, indicating the need for databases that enable
mathematics that quantifies
and The Cancer Genome Atlas (TCGA)14. Project Achilles joint analysis and data exchange between researchers.
the value of information in data.
aims to identify genetic vulnerabilities across large num-
Bioconductor bers of cancer cell lines by systematic loss‑of‑function Bioinformatics tools for integrative analyses
A free, open-source and studies9, and the ENCyclopedia Of DNA Elements In the context of this Review, integrative statistical ana­lysis
open-development software (ENCODE)15 investigates structural and regulatory units refers to the analysis of at least two different types of omics
project for the analysis of
high-throughput genomic data.
in the human genome. Genome-wide association stud- data34. The analysis can be restricted to molecular data
Based on the statistical ies (GWAS) have identified numerous loci that are linked (such as in expression quantitative trait loci (eQTL) stud-
programming language R, the to cancer susceptibility, but the mechanism by which ies, in which the relation between germ line variation and
project was started in 2001 variations at these loci influence susceptibility remains gene expression is investigated35,36) or it can involve clini-
and now contains more than
unknown. Understanding how and why these variants cal outcomes (for example, survival, stage and treatment
750 packages to carry out
data handling, visualization and influence subtype-specific cancer risk contributes to our response) or intermediate phenotypes and biomarkers. It
analysis. understanding of cancer aetiology. For example, many is useful to distinguish three broad objectives of integra-
recent studies emphasize that the genetic architecture of tive analysis, which can be addressed by different statisti-
Expression quantitative breast cancer is context specific, and integrated analysis cal tools. The first objective is to understand molecular
trait loci
(eQTL). Genomic loci that
of gene expression and chromatin remodelling in nor- behaviours, mechanisms and relationships between and
regulate expression levels of mal and tumour tissues will be required to explain the within the different types of molecular structures, includ-
mRNAs or proteins. mechanisms of risk alleles16. In a network-based strategy, ing associations between these and various phenotypes,

300 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

Serum glycans, Single-level classifications


miRNA, ctDNA and Image based
Blood samples interleukins, etc. classification,
mammograms, x-ray
Genotype and MRI

Tumour cells Histopathology Integrated classifications


in blood and Clinical and
bone marrow pathology-based Clinical
Sentinel node classification outcome

Biopsies from
primary tumours Genomics Copy number,
Response
and metastases methylation and
signature
sequence
classifications
Various tissues: Genotypic and
• Normal Transcriptomics phenotypic Resistance
• Preinvasive stratification of signature
• Malignant mRNA classification, breast cancer
miRNA and ncRNA Adverse
In vivo MRI classification effect
signature

Proteomics
Primary tumour Personalized
Protein classification
using RPPA, IHC, treatment
In vivo MRS
LC–MS/MS, etc.

Xenografts,
GEMMs and Expected impact:
cell lines Metabolomics • Improved quality of life
Mouse models • Higher survival rates
Exposure: HR-MAS MR
radiation, diet, classification
hormones, etc.

Figure 1 | The systems biology of breast cancer. Exploring the systems biology of breast cancerNature Reviewsto| Cancer
and strategies
investigate multi-dimensional interactions by integration of data from various sources at the indicated levels. ctDNA,
circulating tumour DNA; GEMMs, genetically engineered mouse models; HR-MAS MR, high-resolution magic angle
spinning magnetic resonance; IHC, immunohistochemisty; LC–MS, liquid chromatography–mass spectrometry; miRNA,
microRNA; MRI, magnetic resonance imaging; MRS, magnetic resonance spectroscopy; ncRNA, non-coding RNA; RPPA,
reverse phase protein array. Mammography image courtesy of M. M. Holmen of Oslo University Hospital, Oslo, Norway.

such as clinical outcomes, pathways, interactions, ‘hot-spot’ Sequential analysis: combining several distinct omics
DNA mutations and mutations in genes that drive can- levels of evidence. This approach allows the confirma-
cer development. The second objective is to understand tion or refinement of findings based on one data type,
the taxonomy of diseases, thereby classifying individuals with additional analyses of further omics data obtained
(or samples) into latent classes of disease subtype; and from the same set of samples. In this case, at least two
the third objective is to predict an outcome or phenotype types of omics data are analysed — for example, copy-
(such as survival or efficacy of therapy) for prospective number alterations (CNAs) and gene expression level
patients. Some statistical methods are specialized to one data. To integrate two different levels of omics data from
type of question, and others can be used for several. These the same set of breast cancer samples, Chin et al.37 iden-
statistical methods are classified into broad groups (sum- tified genes whose expression levels were significantly
marized in TABLE 1). Some of the tools, such as enrich- deregulated by CNAs, as well as genes that are associated
ment analysis, were originally designed to reveal features with metastasis and reduced survival. Lando et al.38 used
of genes and pathways, whereas others, such as integra- CNAs integrated with gene expression and gene ontology
tive clustering, were designed to reveal features of patient to identify genes representing five biological processes
subgroups; however, most of the tools discussed below associated with poor outcome in cervical cancer after
can be applied to both, including integrative graphical chemotherapy and radiotherapy. Moreover, Beroukhim
models, which can be used to identify aberrant pathways et al.39 combined data from 3,131 cancer specimens,
and patient subgroups. The statistical methods discussed which represented 26 different histological types of can-
in this Review can be classified as unsupervised or super- cer, and identified 158 regions with focal CNAs that were
Over-fitting vised (for example, according to whether one proceeds in significantly altered across all samples. Interestingly, 122
In statistics, over-fitting occurs an exploratory manner or applies clinical labels to indi- of these CNAs did not harbour a known cancer gene.
when a statistical model
describes random noise
vidual cases). Some methods use cross-validation or other Each of these papers used the approach in which an
instead of the underlying model selection approaches to estimate the over-fitting in analysis of each data set is made independently of the
relationship. the training set. others and produces a list of interesting entities, which

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 301

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

Table 1 | Tools and algorithms for the detection of activated and altered pathways
Method Summary Refs
Sequential analysis
MCD Identification of subsets of genes that are affected on multiple levels 44,132
by some condition
CNAmet Identification of genes that show simultaneous methylation, copy 45
number and expression alterations
iPAC Integration of copy number and gene expression to detect genes and 42
associated pathways or processes that are influenced in trans by copy
number
Consensus clustering Starting from multiple clusterings (each can represent a data type), 133–136
obtaining a single integrated cluster assignment
CHESS Determining the effect of copy number on gene expression 137
Latent variable
iCluster Starting from multiple data types, obtaining a single integrated 48,49,138
cluster assignment
PSDF Integrating copy number and gene expression data to discover 50
prognostic patient subtypes
IntegrOmics Identification of relationships between two ‘omics’ data sets 72
Penalized likelihood
Lasso Identification of omics features with predictive ability for a given 52–54
response (such as survival), using all data as covariates or using some
data to decide the penalty of others
Elastic Net Identification of omics features with predictive ability for a given 55
response (such as survival), using all data as covariates or using some
data to decide the penalty of others
PLRS Studying relationships between copy number and mRNA expression; 139
detection of copy number-induced sample subgroup-specific effects
Camelot Outputs a linear regression model that uses genotype and expression 140
to predict phenotype; powered by regularized linear regression
Lol (Lots of Lasso) Integration of copy number and gene expression to detect in‑cis and 141
in‑trans regulation of gene expression
Gene set analysis
GeneXPress Extraction of modules and characterization of gene expression 56
profiles in tumours as a combination of activated and deactivated
modules
GSEA Gene set annotation of differentially expressed genes 59,142,143
MAPPFinder Gene ontology term annotation of differentially expressed genes 64, 67,144

SPIA Pathway annotation of differentially expressed genes 65,145,146


Pathologist A consistency score and an activity score is calculated for each 66
pathway
KOBAS Pathway and disease annotation of gene sets 147,148

SubpathwayMiner Pathway annotation of gene sets 149,150


MGSA Identification of active gene sets 82
Pair-wise correlation
WGCNA Finding modules of highly correlated genes using eigengene network 73
methodology
Oncodrive-CIS Ranking genes according to the effect of copy number on gene 151
expression
Network-based analysis
jActiveModules Identification of expression-activated sub-networks 78,152,153
GiGA Identification of the gene subgraphs showing the most significant 79
gene expression pattern

302 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

Table 1 (cont.) | Tools and algorithms for the detection of activated and altered pathways
Method Summary Refs
Network based analysis (cont.)
PARADIGM Prediction of the degree to which the activities of a pathway are 86,87,125
altered in an individual
PathExpress Determining if there is enrichment of genes around each enzyme, on 154
the basis of gene–metabolic relations in KEGG
AMBIENT Discovery of metabolic sub-networks that are significantly changed 155
by some condition
Bayesian
CONNEXIC Integration of copy-number variation and gene expression to identify 91
driving cancer mutations and the processes that they influence
COALESCE Using gene expression and DNA sequence data as inputs, this 68
method produces putative co‑regulated modules as outputs
MDI Identify groups of genes that tend to be allocated to the same 156
components in multiple data sets or molecular levels
Other
RegMOD Identification of active modules or dysfunctional pathways 61
AMBIENT, Active Modules for BIpartitE NeTworks; CHESS, CgHExpreSS; CNAmet, Copy Number Alteration and methylation;
COALESCE, Combinatorial ALgorithm for Expression and Sequence-based Cluster Extraction; CONNEXIC, COpy Number and
EXpression In Cancer; GiGA, Graph-based iterative Group Analysis; GSEA, gene set enrichment analysis; iPAC, in‑trans Process
Associated and Cis-correlated; KEGG, Kyoto Encyclopedia of Genes and Genomes; KOBAS, KEGG Orthology Based Annotation
System; MAPPFinder, MicroArray Pathway Profile Finder; MCD, Multiple Concerted Disruption; MDI Multiple Dataset Integration;
MGSA, Model-based Gene Set Analysis; PARADIGM, PAthway Recognition Algorithm using Data Integration on Genomic Models;
PLRS, Piecewise Linear Regression Splines; PSDF, Patient-Specific Data Fusion; RegMOD, Regression MODel with diffusion kernel;
SPIA, Signalling Pathway Impact Analysis; WGCNA, weighted gene correlation network analysis.

are then linked to each other. For example, differentially derived a gene signature, known as Complexity INdex in
expressed genes in one list are compared with each other SARComas (CINSARC), by combining known genes of
and then with different CNAs that have been matched importance with genes whose expression correlated to
to the closest gene in a second list. Usually, the lists are CNAs and were members of over-represented pathways,
intersected to find the genes that are confirmed in the including those that effect chromosomal instability or
analysis of each data type40. Comparing ranks of each histological grade. Interestingly, CINSARC predicted
gene in each list leads to measures of concurrence. If the likelihood of metastasis development (a surrogate for
each entity in each list has a value (for example, a t‑test survival) in patients with sarcomas and also in patients
statistic) then these values are combined. Although each with gastrointestinal stromal tumours (GISTs), breast
list often contains significantly selected entities after cancer and lymphoma, which points to the ability of
multiple testing corrections, it is not obvious how to integrative approaches to identify universal features
assign a P value to the intersection. Permutation test- of aggressive cancer.
ing of each individual analysis before intersection could, Several methods regard the expression level of any
in principle, be used. One such flow chart-based data transcript as a function of copy number and DNA methy­
integration framework is Anduril, in which the ultimate lation. An example is a tool called Multiple Concerted
goal is to elucidate the impact of various omics data on Disruption (MCD), which aims to integrate DNA copy
patient survival41. Occasionally, the various analyses number and methylation to explain variation in mRNA
are not performed in parallel but as a sequence of fil- expression data in cis 44. The MCD method searches for
tering steps, each functioning on a single data type. In deviation from the normal at several levels: a differential
this approach, the order of the filtering steps matters. expression, a change in gene copy number or a change
In the in‑trans Process Associated and Cis-correlated in the degree of DNA methylation44 (hypomethylation
(iPAC) algorithm42, which was designed to detect can- or hypermethylation). The procedure involves several
cer drivers, whole-genome gene expression measure- sequential steps and can be carried out either per sam-
ments are correlated with segmented copy-number data ple or across a set of samples. By sequentially examining
to obtain a list of genes with strong in‑cis correlation. more genomic dimensions at the DNA level (that is, copy
T‑test statistic
Using each of these in turn as a pivot, all other genes number, allelic status and DNA methylation) one can
T‑tests are used to determine
whether the mean of a in the genome are ranked according to their correlation explain a higher proportion of the observed changes in
continuous variable is different to the pivot, and enrichment of gene ontology terms gene expression. Notably, this varies to a great degree
in two groups of individuals. It for genes at the top of the ranked list is investigated. from sample to sample, which indicates intrinsically dis-
is based on a quantity called a As a result, iPAC identifies CNAs with a phenotypic tinct mechanisms leading to deregulation44. The MCD
t‑test statistic, which is
computed from the data and
effect in the sense that they have an impact on expres- method was followed by a similar method 45 (Copy
reflects the signal-to‑noise sion in cis, as well as on processes in trans. In another Number Alteration and methylation (CNAmet)), which
ratio. study that used data from sarcomas, Chibon et al.43 was implemented in open-source R28,46.

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 303

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

Expectation-maximization Latent variable analysis: using common factor labels CNA will be penalized less in the expression analysis38.
algorithm derived from multiple omics levels. Unsupervised Regression-based integration is mostly in cis, but it can
(EM algorithm). An iterative clustering of omics data can be used to partition indi- easily be extended to more data types.
algorithm for the estimation of viduals or samples into subgroups of potential clinical
parameters in statistical
models depending on
relevance47. In the iCluster package48, for example, the Gene set analysis: discovering novel or using known
unobserved variables. A clustering of individual samples is carried out by apply- groups of related molecules. One of the earliest reported
limitation with EM is that it ing metrics (or noise structures) that are specific to examples of an integrative approach for gene expression
requires specification of initial each data type but using common latent labels among data was the use of GeneXPress to identify modules of
values for the iteration, and the
all data types, employing an expectation-maximization genes that affect the activity of a tumour 56. Segal et al.56
estimated parameters may
depend on these.
algorithm (EM algorithm). This method can be extended analysed data from 22 cancer types and found that dis-
to supervised clustering when the data are continu- tinct shared modules of gene activity, which probably
Lasso ous, such as for expression data or CNAs49, and it can represented common tumour progression mechanisms,
A shrinkage and variable accommodate any number of data types. The number characterized distinct tumour types. A different strat-
selection method for linear
regression, used in particular
of clusters is difficult to determine and is estimated by egy involves initially defining a collection of gene sets
when there are many cross-validation methods. Using iCluster, Curtis et al.49 (for example, gene ontology terms or pathways). This
covariates (for example, genes). found that genome variation influenced gene expres- step typically involves the use of publically available data-
sion and identified putative cancer genes, and it defined bases that collect extensive annotation and knowledge
novel subgroups of patients with breast cancer who had (for example, Kyoto Encyclopedia of Genes and Genomes
distinct outcomes48. Furthermore, trans-acting aber- (KEGG), Reactome and WikiPathways57; see Further
rant DNA hot-spots that modulated subtype-specific information). A score is calculated for each gene 58
gene networks were shown. A further development has (for example, a P value that reflects the degree of differ-
been suggested by Yuan et al.50. Their Patient-Specific ential expression), and all gene sets that are ‘enriched’ or
Data Fusion (PSDF) algorithm exploits the fact that the over-represented with high or low scores are identified.
data to be integrated in individual samples might seem The scores can also be binary (0 or 1), thereby indicat-
to be contradictory within the data pool; for example, ing, for example, membership in a group of differentially
a high copy number of a gene could be associated with a expressed genes. By combining gene ontology, gene
high expression of the same gene in cis in most, but expression and clinical data, Subramanian et al.59 used
not all, samples. Such a contradiction can be seen as a gene set enrichment analysis (GSEA) to identify genes
measurement error or biological variation due to the cell consistently associated with poor outcome in two inde-
composition of a biopsy or patient characteristics. PSDF pendent cohorts of patients with lung cancer. Information
estimates a latent variable per patient, which helps to based on known protein–protein interactions has been
exclude (or minimize) contradictory samples. This idea used to identify gene modules expressed in non-malignant
could potentially be used beyond clustering, for other bystander cells60, associated with metastatic disease61 or
tasks of integrative analysis. associated with aggressive disease in lymphoma60–63,157.
Several alternative ways of scoring the abnormal presence
Penalized likelihood analysis: using regularization of specific pathways have also emerged, including Gene
to handle high-dimensional multi-omics data. The Microarray Pathway Profiler (also known as MicroArray
aim of integrative regression is to determine the genes Pathway Profile Finder (MAPPFinder))64. These methods
(or entities) — using at least two different omics data describe the functional profile of a list of genes/proteins
types — that allow the best prediction of the outcome. by comparing with known (a priori) interactions, scor-
Since the number of covariates mostly supersedes the ing the over-representation of a given pathway, ignoring
number of samples, some form of variable selection or any knowledge about the network structure. Other tools,
penalized regression is necessary 51. When sparsity can be such as Signalling Pathway Impact Analysis (SPIA)65 and
assumed (that is, when only a few entities are expected Pathologist 66, exploit pathway topology by taking into
to actually be relevant for the outcome), Lasso52,53 is a account the position of a gene in a pathway. SPIA uses
very useful penalization method, as it carries out vari- the number of neighbours for every gene (the ‘degree’),
able selection. Cross-validation is used to determine so that a gene with a higher degree is more likely to have a
an optimal level of penalization, which influences the master role than a more isolated gene and is then favoured
sparsity of the solution. A straightforward way to use in the analysis of the original data67. Pathologist assumes
Lasso with two different data types is to use all data as that every gene is either active or inactive in a network,
covariates54 (after appropriate standardization): in this and this method models this as a mixture of two gamma
case, the algorithm chooses the optimal set of predic- distributions, using the EM algorithm to compute both a
tors from either omics source. Adaptive Lasso works in gene activity score and an overall pathway score68,69.
two steps and, like Elastic Net 55, is more parsimonious
than Lasso. A different analysis is known as Weighted Pairwise correlation analysis: inferring molecular net-
Lasso, in which the Lasso uses only one covariate type work interactions from strengths of associations. In this
(such as the mRNA expression level) while the other type of analysis, for each pair of co‑measured omics data,
covariate modifies the penalization so that genes are a correlation matrix is estimated70, with P values that are
individually penalized. For example, the penalization of corrected for multiple testing and that therefore reflect
a gene expression can depend on the correlation between the strength of association. This approach includes asso-
the CNA and the outcome, so genes with an important ciations in trans. The structure in the matrix can be used

304 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

to identify master regulators71. Correlation analysis does directed graph, for which the key biological assumptions
not directly facilitate the study of how entities (such as are followed: that is, for each gene, CNA affects expres-
expression levels and CNAs) regulate outcomes of inter- sion, which affects protein levels, which affect the latent
est, but highly correlated entries can be used in further protein activity. These activity nodes are then connected
studies, such as in canonical correlation analysis72. There to other gene-specific nodes in ways that are predicted
are multiple ways to extend the correlation analysis to from existing knowledge. This graph represents the ‘nor-
more than two data types. For example, weighted gene mal’ or reference state. When data are attached to some
co‑expression network analysis describes the correla- of the nodes (for example, all expression levels and CNAs
tion patterns among genes across microarray samples. for a sample of individuals with a specific disease), a joint
Weighted gene correlation network analysis (WGCNA) posterior distribution is then computed for all latent activ-
is a method for finding clusters (or modules) of highly ity nodes, and this approach is called integrated pathway
correlated genes using matrix calculus73. Correlation net- activities (IPAs). By comparing pre- and post-activity lev-
works facilitate network-based gene screening methods els, it is possible to obtain a quantitative description of the
that can be used to identify candidate biomarkers or ther- alteration that is induced by the disease, with respect to
apeutic targets. In order to identify higher order inter- normality (or between different groups). In order to make
actions, for example those in which highly co­operative the computations feasible, all measurements are catego-
processes involve many subunits of a protein, depend- rized in three discrete states (inhibited, normal and acti-
Maximum entropy
ence among multiple variables can be established using vated). Despite this, the computational burden is so large
techniques
An alternative to maximum maximum entropy techniques 74 or information theory that the EM algorithm is locally applied in each node, and
likelihood, maximum entropy approaches. These methods are distinct from correlation this leads to an approximation that reduces computational
techniques are a way to methods and in some ways might be more powerful75. time. PARADIGM was tested using copy number and
estimate models from data, by mRNA expression data86, as well as with the addition of
finding the most random
probability distribution that fits
Network analysis: using molecular network interactions methylation and miRNA expression data87. The method-
the data. to identify active or aberrant subgraphs. Networks are ology is, in principle, open to incorporate further levels
a representation of how genes or other entities collabo- of complexity, such as different progression levels (from
Simulated annealing rate in certain biological systems76,77. A graph ‘sums up’ normal tissue to pre-invasive and invasive cancer), that
A global optimization algorithm
these effects over time, and two genes will be linked add an additional level to the analysis. An example from
that seeks a good
approximation to the point of by an edge if they seem to interact in a specific pro- breast cancer that combined both different progression
absolute maximum of a cess. Graphical algorithms that capture the interaction levels, as well as multiple levels of molecular data, clini-
function. between differentially expressed genes by correlation cal data and pathway information, used the properties of
include jActiveModules78 and Graph-based iterative PARADIGM to define groups of patients with distinct
Greedy search algorithms
In optimization, a greedy
Group Analysis79 (GiGA). jActiveModules integrates biological signatures and different prognoses87.
algorithm is an iterative knowledge from protein–protein and protein–DNA
algorithm that takes an optimal interaction databases into mRNA expression data by Bayesian analysis: imposing realistic assumptions to
(or semi-optimal) choice at assigning a Z‑score for differentially expressed genes, coherently integrate multiple omics data. Bayesian
every step, in the hope of
and it searches for connected sub-networks by simulated methods naturally facilitate the integration of biologi-
obtaining the global solution at
convergence. These algorithms annealing and greedy search algorithms60,80. Both simu- cal knowledge through the design of appropriate prior
do not generally result in lated annealing and greedy search identify differentially distributions. In a Bayesian multiple testing setup,
optimal solutions and are used expressed sub-networks; in the first case, in an optimal one can use a second type of omic data (for example,
when the determination of a but computationally very intensive way; in the second CNAs) to modulate the a priori probability that each
global solution would require
an unacceptable amount of
case, more rapidly but less accurately. GiGA also ranks test for a first data set (for example, expression levels)
computing time. genes on the basis of differential expression levels and is likely to be rejected88. Bayesian networks are not new,
searches for sub-networks. In 2011, Stingo et al.81 pro- and they were used in the early 2000s to incorporate
Bayesian approach posed a Bayesian approach that selects both the actual various data89. As is true for every statistical method,
An approach to statistics that
pathways (out of a large set of possible ones) and the Bayesian analysis is based on assumptions (both prob-
involves starting from our
current (a priori) level of key genes that allow the best prediction of an outcome. abilities and prior assumptions) and models based on
knowledge, collecting data and Additional Bayesian methods to infer the functional these have to be realistic and well designed so that
then using both to infer our content of a list are presented in REFS 82,83. they can be trusted. Usually, in a Bayesian setting, one
(a posteriori) knowledge. Statistical graphical models with feedbacks have been carries out sensitivity analysis to assess informative
Bayesian inference allows the
incorporation of additional
successful in summarizing data and representing path- prior assumptions, but, in most cases, prior assump-
external knowledge into the ways. The study of such networks can lead to an impor- tions are non-informative. Nevertheless, Bayesian
estimation process. tant understanding of biological mechanisms84,85. A few, approaches have a natural and important role in data
such as PAthway Recognition Algorithm using Data integration, and they differ by the fact that prior dis-
Latent variables
Integration on Genomic Models (PARADIGM), have tribution can represent knowledge, and conditional
In statistics, latent variables (as
opposed to observable data) been extended to integrative analysis86. Activity levels independence facilitates the integration of data in a
are not measured but must be of each gene are considered as latent variables, which are coherent way. Bayesian variable selection has been suc-
estimated from data, similar to estimated and then used in subsequent analyses. Given a cessfully applied to situations that comprise one data
parameters. However, contrary set of genes of interest, the first step is to assemble from type90, but it could be extended to multiple data types
to parameters, latent variables
are random and have a
public databases (see Further information for examples) using similar fundamental biological assumptions as
distribution. Latent models are a large enough network of genes, with their activating in Huttenhower et al.69. Other computational frame-
inherently Bayesian. and inhibitory interactions, and to transform this into a works use integrative Bayesian approaches to identify

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 305

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

Support vector machines candidate drivers from copy number and expression nucleotide sequence aberrations from glioblastoma
In machine learning, support data (for example, COpy Number and EXpression samples 95. Enrichment analysis revealed new roles
vector machines are supervised In Cancer (CONNEXIC)91), in order to cluster sam- for known cancer genes, as well as network activity.
learning models that are used ples, while simultaneously estimating the number of Later, the same data set was interrogated by Anduril41
for classification and regression
analysis.
clusters50, or to perform regulatory module predic- and by PARADIGM86. Both approaches suggested that
tions from co‑expressed biclusters (Combinatorial amplification of the epidermal growth factor receptor
ALgorithm for Expression and Sequence-based Cluster (EGFR) was important in glioblastoma. Anduril, which
Extraction (COALESCE)68). can make use of DNA methylation data, also indicated
There are additional methods that do not naturally DNA hypomethylation as a significant change that was
fit into the defined classifications above. Models that are evident in glioblastoma.
based on ordinary differential equations can integrate Data from the first pan-cancer analyses aim to iden-
various types of data but, as they model complex chemi- tify drivers of tumorigenesis that are common to multi-
cal reactions at a molecular level, they require a large ple tumour types96,97. For example, the aim of TCGA is
number of input parameters that are usually unknown92. to generate genomic data at all molecular layers in 10,000
In some cases, support vector machines have been used tumours from 20 tumour types and to make these data
to evaluate the active score of each gene and to identify available for the community 98. A recent endeavour to
nonlinear dependencies in active networks in a compu- integrate somatic mutations, CNAs and DNA methyl­
tationally efficient way (for example, Regression MODel ation was carried out in 3,299 tumours of 12 different
with diffusion kernel (RegMOD)61). cancer types96. After integration with mRNA expres-
The novel statistical and computational approaches sion, a total of 479 candidate functional alterations were
described above are bringing us to a level at which we predicted, including 116 copy-number gains, 151 copy-
can analyse molecular data at all studied molecular number losses, 199 recurrently mutated genes and 13
levels (FIG. 1), in an integrated manner. For instance, epigenetically silenced genes. A hierarchical stratifica-
by using the matrix of IPAs generated by PARADIGM, tion was built using principles from network modular-
from the summarization of copy number, expres- ity 99. Interestingly, on the basis of these analyses, tumours
sion level and known interactions among the genes, seemed to be driven either by somatic mutations or by
one could identify a better prognostic signature than CNAs — a phenomenon that the authors named ‘the
the one derived from expression clusters alone 87 cancer genome hyperbola’, owing to the inverse relation-
(FIGS 2–5) . By layering on CNAs and mutation data ship between these events. However, some genes, such
it has become possible to deduce how an individual as TP53 and phosphatidylinositol‑4,5‑bisphosphate
tumour evolved93,94. Furthermore, Chari et al.44 showed 3‑kinase, catalytic subunit-α (PIK3CA), can be subjected
that by examining samples using more genomic dimen- to both aberration modes, thereby leading to the deregu-
sions, including copy number, allelic status and DNA lation of common pathways such as p53‑mediated apop-
methylation, they were able to explain a higher pro- tosis, PI3K–AKT signalling and cell cycle control.
portion of the variation in gene expression compared Studying the relationship between the different
with studying each genomic level separately, using only genomic levels (FIGS 2–5) opens a debate over their
one genomic level. Remarkably, the proportion of vari- explanatory weight and potential to discover drivers of
ation in gene expression widely varied from patient to cancer 100. Ovaska et al.41 found unexpectedly poor con-
patient, which indicates different regulatory mecha- cordance between gene amplification, overexpression of
nisms and complex individual gene–gene interactions the genes from the amplicons and survival in patients
in trans that are specific for every tumour. This inter- with glioblastoma. Akavia et al.91 showed that the expres-
individual variation might be a limiting factor in the sion of a driver (not its copy number per se) drives a
identification of molecular markers that are associated pheno­type. The authors draw our attention to the fact
with tumour aggressiveness, response to therapy and that many of the current studies attempt to identify driv-
patient outcome. ers only in genomic loci for which there is a good cor-
relation between copy number and mRNA expression.
Integrative analyses across tumour types So far, many current approaches have been based on
Over the past decade, the accumulation of high- linear correlation analysis. On the basis of knowledge
throughput molecular data from various cancer of the enzyme kinetics and gene regulation, we expect
types has revealed an enormous range of alterations. nonlinear dependencies to occur in addition to linear
Although subgroups of tumours with similarities effects. We recently proposed a statistical approach to
in biological properties or clinical behaviour can be investigate linear and nonlinear dependencies between
defined, the initial studies mainly analysed one type CNA and mRNA expression101.
of molecular data at a time. The access to large data
sets that have been made available by the ICGC and Clinical application. The discussion above has
TCGA has made it possible to compare the perfor- addressed the problem of inferring biological net-
mance of some of the tools described above, on the works of relevance for translation into the clinic,
same data set, as well as to compare the identified based on a simple map of genes, transcripts and pro-
deregulated pathways between different cancer types. teins. A paradigm shift is needed, from searching for
A pilot project from TCGA integrated DNA copy num- single strong clinical markers to searching for a com-
ber, gene expression and DNA methylation, as well as bined effect of multiple markers, as, in general, genes

306 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

a Unsupervised clustering at each level b Survival within clusters c Comparison of clusters on different levels
1.0 Sample ID CNA Meth miRNA mRNA
25 MICMA263 4 1 2 1
0.8 MICMA220 4 1 2 1
20 MICMA112 4 1 2 1
Copy number

MICMA139 4 1 2 1

Proportion
0.6
15 MICMA098 4 1 2 1
MICMA085 4 1 2 7
10 0.4 MICMA144 4 1 2 7
CNA 1 (n=26) CNA 4 (n=27) MICMA150 4 1 2 3
CNA 2 (n=13) CNA 5 (n=13) MICMA122 4 1 2 3
5 0.2
CNA 3 (n=6) CNA 6 (n=25) MICMA223 4 1 2 2
0.0
MICMA300 4 1 3 4
0
0 20 40 60 80 100 120 MICMA023 4 1 3 4
1

MICMA034 4 1 3 1
A

Time (months)
CN

CN

CN

CN

CN

CN

MICMA014 4 1 1 1
MICMA017 4 1 1 1
1.0 MICMA083 4 1 1 4
25
MICMA232 4 4 2 6
0.8 MICMA019 4 4 1 6
20 MICMA068 4 3 3 2
mRNA expression

MICMA267 1 4 2 7
Proportion

0.6
15 MICMA091 1 4 2 7
MICMA283 1 4 2 7
10 0.4 Exp. 1 (n=20) Exp. 5 (n=11) MICMA298 1 4 2 6
Exp. 2 (n=8) Exp. 6 (n=26) MICMA119 1 4 3 5
5 0.2 Exp. 3 (n=10) Exp. 7 (n=20) MICMA080 1 4 3 5
Exp. 4 (n=16) MICMA086 1 4 3 4
MICMA088 1 4 1 7
0 0.0
0 20 40 60 80 100 120 MICMA020 1 4 1 5
1

MICMA015 1 4 1 2
p.

p.

p.

p.

p.

p.

p.

Time (months)
Ex

Ex

Ex

Ex

Ex

Ex

Ex

MICMA146 1 1 1 7
MICMA069 1 1 1 2
1.0 MICMA201 1 1 1 1
40 MICMA024 1 1 3 1
0.8 MICMA042 1 2 2 6
30 MICMA064 1 2 1 6
miRNA expression

MICMA246 1 3 2 6
Proportion

0.6
MICMA106 5 1 2 3
20 MICMA275 5 1 2 3
0.4 MICMA309 5 1 2 2
miRNA 1 (n=24) MICMA003 5 1 2 1
10 miRNA 2 (n=43) MICMA089 5 1 3 4
0.2
miRNA 3 (n=32) MICMA338 5 1 3 4
MICMA101 5 1 3 1
0 0.0
0 20 40 60 80 100 120 MICMA209 5 4 3 6
1

MICMA264 5 4 3 3
NA

NA

NA

Time (months)
iR

iR

iR

MICMA065 5 2 2 3
m

MICMA222 2 4 3 7
1.0 MICMA044 2 4 3 5
40 MICMA371 2 4 1 6
0.8 MICMA318 2 2 2 7
MICMA057 2 2 2 5
30
DNA methylation

MICMA067 2 3 2 6
Proportion

0.6
MICMA022 2 1 2 1
20 MICMA053 3 1 2 6
0.4 MICMA221 3 1 2 6
MICMA308 3 1 2 7
10 0.2 Meth. 1 (n=43) Meth. 3 (n=4) MICMA632 3 1 2 1
Meth. 2 (n=9) Meth. 4 (n=33) MICMA079 3 4 3 7
MICMA018 3 3 2 6
0 0.0
0 20 40 60 80 100 120 MICMA245 6 1 2 2
.1

.2

.3

.4

MICMA355 6 1 1 4
h

Time (months)
et

et

et

et
M

Figure 2 | Classifying breast cancer using unsupervised clustering. collected several layers of high-throughput molecular data from
The first solid tumour to be profiled by expression arrays was patients with breast cancer, including DNA methylation, DNA copy
Nature Reviews | Cancer
carcinoma of the breast119. The most reproducible classification by number alterations, mRNA expression and microRNA (miRNA)
mRNA expression is based on the biological entities referred to as the expression93,126–131. Clustering according to each molecular level
intrinsic subtypes — luminal A, luminal B, basal-like, human epidermal reveals a variable number of clusters (part a). Kaplan–Meier plots are
growth factor receptor 2 (HER2)-enriched and the normal-like shown for each patient cluster within each molecular level (part b).
groups120,121. In the past decade, several molecular studies to classify Comparison of clusters on different molecular levels reveals that some
breast cancer have added one or two molecular levels — most breast cancer samples cluster together at all the molecular levels,
frequently, DNA copy number42,49,122,123 and gene sequencing124. while others cluster in different groups according to the particular
However, few of the studies have integrated more than two levels of molecular endpoint (part c). Figure parts a,b are reproduced, with
information from the same patients87,125. In our laboratory, we have permission, from REF. 87. Exp, expression; Meth, methylation.

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 307

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

a Class distribution of cluster b Kaplan–Meier curve A major challenge in drug development is to pre-
1.0 cisely define the subset of cancer patients that are
40
likely to respond. Within each pathway, a range of
0.8
30 drugs may be available, and the optimal target (and,
hence, the optimal drug) will be determined by the

Proportion
0.6
20
rate-limiting protein and the individual perturbations
PDGM 1 in the pathway. In colorectal cancer, EGFR-directed
0.4
PDGM 2
therapy with monoclonal antibodies has proven to
10 PDGM 3
0.2 PDGM 4 be effective110. However, in the presence of a down-
PDGM 5 stream activating KRAS mutation, the inhibition
0 0.0
0 20 40 60 80 100 120
of EGFR is ineffective111. It seems likely that similar
1

5
M

mechanisms are present in cases with resistance to


G

Time (months)
PD

PD

PD

PD

PD

c Heat map of IPL other cancer treatments (both targeted and more tra-
ditional chemotherapeutic agents). Iadevaia et al.112
IL4 signalling have proposed a computational procedure to generate
Thromboxane A2 signalling experimentally testable intervention strategies for the
IL23 signalling optimal use of available drugs in a cocktail. They used
IL12 signalling reverse phase protein array to evaluate the changes in
TCR signalling the phosphorylation status of proteins after stimula-
NFAT–calcineurin tion of the MDA‑MB 231 breast cancer cell line with
transcription
insulin-like growth factor, and they were able to con-
clude that the simultaneous inhibition of MAPK and
PI3K–AKT pathways was sufficient to significantly
halt cell proliferation112. Future methods will require
adding methylation and expression data to such inte-
grative approaches. Introducing systematic clinical
screenings for mutations that perturb these pathways
FOXM1 transcription
is of great importance to identify the targets for tar-
ERBB4 geted therapies and the patients that will respond to
each treatment.
Outcome prediction that is based on genomic data
Gene IPL is another central area of genomic research, and it has
2.00 proven to be promising in breast cancer. One of the
1.33 crucial issues in retrospective studies is that treat-
0.67
0.00 ment selection is mostly based on the predicted risk
Endothelins –0.67 of recurrence. Thus, treatment might be confounded
Angiopoietin receptor –1.33 by prognosis. This challenges the identification of pure
TIE2-mediated signalling –2.00
prognostic markers, as the treatment interaction is not
Figure 3 | Classifying breast cancer using PARADIGM. All multiple layers of
Nature Reviews | Cancer known. Even though the results from prospective vali-
high-throughput molecular data described in FIG. 2, including DNA methylation, DNA dation trials, such as the Microarray In Node-negative
copy number alterations, mRNA expression, microRNA (miRNA) expression as well as and 1–3 positive lymph node Disease may Avoid
TP53-mutation status, were subjected to integrated analysis using the PAthway Chemotherapy (MINDACT) trial and the Trial assign-
Recognition Algorithm using Data Integration on Genomic Models (PARADIGM). This
ing individualized options for treatment (TailorX), are
resulted in five clusters (part a) with survival differences (part b) and this was validated in
multiple other datasets87. A heat map of integrated pathway levels (IPLs) is shown in part c.
still pending, prediction tools based on gene expression
FOXM1, forkhead box M1; IL, interleukin: PDGM, PARADIGM cluster; TCR, T cell receptor; are included in some clinical guidelines113,114. Optimal
TIE2, tyrosine kinase, endothelial. Figure is reproduced, with permission, from REF. 87. strategies for risk prediction are, however, not settled
and remain controversial. Crowdsourcing strategies
for problem solving, which were previously success-
and proteins function by interacting with DNA, RNA and fully applied to biology in areas such as the prediction
proteins, and these interactions might be specific for of protein folding and function115,116, have been applied
a given disease subclass102. Many of the current tar- to this problem. In the DREAM BCC competition33,
geted therapies focus on proteins that are involved in participants competed to create an algorithm that could
cell signalling pathways, which form a complex cellu- predict — more accurately than current benchmarks
lar communication system that governs basic cellular — the prognosis of patients with breast cancer from
functions103,104. Established examples of targeted cancer clinical information (age, tumour size and histological
treatment include EGFR-mutated non-small-cell lung grade), genome-scale tumour mRNA expression data
cancer that can be treated with tyrosine kinase inhibi- and DNA copy-number data from 1,980 patients 33.
tors (gefitinib or erlotinib)105,106, ERBB2 (also known as Integration of data was encouraged, and more than
HER2)-directed therapy in breast cancer 107,108, and mela­ 1,400 models were submitted. The winners used a
nomas with BRAFV600E mutations that can be targeted mathe­matical approach that was based on co‑expression
with vemurafenib109. gene networks associated with tumour phenotype and

308 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

important, however, to be aware of the limitations of


Consensus clusters the current methodologies. From a statistical perspec-
a
1 2 3 4 tive, the most fundamental challenge in integrative
PAM50 analyses is dimensionality: taking more levels into
Basal Luminal A account in the analysis tends to increase the dimen-
HER2- Luminal B
enriched Normal- sionality of the problem. Adding more layers of data
like or increasing the resolution of measurements increases
Clinical/mutation data the dimension of unknown parameters, which are
Positive/mutant
Negative/WT often difficult to estimate, thereby making the over-
NA all inference weaker. This might seem paradoxical, as
b miRNA 2 the purpose of taking multiple levels into account is
Methy 2 precisely the opposite — to use more observations to
CN 2
PAM50 LumA obtain a more accurate picture of the biological system
RPPA LumA
Methy 1 under study. The way out of this apparent paradox is
miRNA 6
RPPA reactive I
CN 4
to realize that, first, one is able to infer more properties
RPPA reactive II
miRNA 3
of a system with integrative approaches and, second,
CN 1
Methy 4
statistically efficient integrative methodologies can be
miRNA 4 constructed by actively using known properties of the
PAM50 HER2
RPPA HER2 relationships between the molecular levels. The second
CN 5
Methy 3 point ensures that additional variables in the analysis
PAM50 LumB
RPPA LumA/B are not, in effect, increasing the degrees of freedom of
miRNA 7
miRNA 1 the underlying model but rather lending information
RPPA X
c PAM50 P<0.001
to existing variables. In addition, at every step, there
ER P<0.001 will be checkpoints of compatibility of the data, such
PR P<0.001 as normalization to the same scale, sample selection
HER2 P<0.001
T P<0.002 from representative cohorts, adequate correction for
N P<0.01 technical batch effects and use of different platforms.
TP53 P<0.001
PIK3CA P<0.001 Although numerous methods and tools are introduced
GATA3 P<0.001 to address these obstacles, it is still, so far, the case that
MAP3K1 P<0.001
MAP2K4 P<0.02 large-scale true integration is possible within only a
Figure 4 | Classifying breast cancer using clustering of clusters. Consensus clustering few projects worldwide, which have sufficient funding
(or ‘cluster of clusters’) of 348 breast cancer cases, based on dataNature
from five different
Reviews | Cancer that allows all analyses to be carried out simultaneously
genomic and proteomic platforms. Consensus clustering analyses of the subtypes and on the entire data set. Intuitively, it seems that as
identifies four major groups; the blue and white heat map displays sample consensus a ‘gold standard’, integration attempts are best car-
(part a). A heatmap display of the subtypes defined independently by microRNAs ried out in supervised settings that are based on some
(miRNAs), DNA methylation, copy number, PAM50 mRNA expression, and reverse phase priming biological knowledge or within the frame of
protein array (RPPA) expression; the red bar indicates membership of a cluster type
defined biological hypotheses. Combining additional
(part b). Associations with molecular and clinical features, with P values from a chi-squared
test are shown in part c. CN, copy number; ER, oestrogen receptor; GATA3, GATA binding layers in unsupervised analyses might fail to contrib-
protein 3; HER2, human epidermal growth factor receptor 2; LumA, luminal A; LumB, ute new information, as multiple use of the same data
luminal B; MAP2K4, mitogen-activated protein kinase kinase 4; MAP3K1, mitogen- might artificially reduce variance or will increase the
activated protein kinase kinase kinase 1, E3 ubiquitin protein ligase; Methy, methylation; false discovery rate.
N, node status; NA, not available; PAM50, gene expression subtyping based on the PAM50
gene signature; PIK3CA, phosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit Conclusions
alpha; PR, progesterone receptor; T, tumour size; WT, wild type. Figure is reproduced, with A more fundamental understanding of the biological
permission, from REF. 125 © (2012) Macmillan Publishers Ltd. All rights reserved. dynamics of cancer will enable us to better identify
risk factors, refine cancer diagnosis, predict therapeu-
functional characteristics to identify signature ‘attrac- tic effects and prognosis, and identify new targets for
tor’ meta-genes, and this approach outperformed therapy. We are seeing a paradigm shift from large ran-
other models to predict outcome117,118. These examples domized clinical trials towards treatment modalities
support the notion that using the expertise of par- that are tailored for stratified patient groups, down to
ticipants outside of traditional biological disciplines N-of-1 trials, in which data from a single patient rep-
could be a powerful way to accelerate the translation of resents an entire trial. This will fundamentally alter the
biomedical science into the clinic. way that we statistically model and evaluate treatment
strategies, from identifying patient groups that have a
Limitations of integrative analyses response to treatment that is above random to iden-
Integrative analyses are likely to become ever more tifying pathways and biological entities that are drug-
important as computational strategies and tools are gable and altered above random; and from evaluating
further improved and multilevel omics data sets the response in randomized arms, using the other arm
become more abundant. The quest to understand as a control, to evaluating the response of experimen-
the interplay within and between different molecular tal and control interventions in each individual, using
levels in cancer is no longer beyond our reach. It is the same individual as a control. The real challenge

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 309

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

would be to develop statistical models to identify tumours. As we are moving towards an era in which
crucial, rate-limiting molecular targets for inter- the amount of data produced every year is increas-
vention, out of the wealth of information that next- ing exponentially, the biomedical community needs
generation sequencing uncovers, on the background to embrace this complexity and find new methods
of great redundancy of pathways and heterogeneity of of shared analysis. We need to learn from physicists

a
GI
IntClust
PAM50
ER
TP53
Grade
NPI
17q25 ER NEG
Grade 1 ER POS
Grade 2 Basal
17q12
Grade 3 HER2
NPI 2 Luminal A
NPI 3 6p13–q24 Luminal B
NPI 4 Normal
NPI 5 IntClust 1
NPI 6 11q13/14
IntClust 2
TP53 IntClust 3
wild type 8q24 IntClust 4
TP53
IntClust 5
mutant
TP53 Null IntClust 6
8p12
allele IntClust 7
IntClust 8
1q21–42 IntClust 9
IntClust 10

Low High
b Discovery set
1.0
Logrank P = 1.2x10–14

0.8
IntClust1: 74(18)
Disease specific survival probability

IntClust2: 45(20)
IntClust3: 150(19)
IntClust4: 164(32)
0.6
IntClust5: 91(48)
IntClust6: 44(14)
IntClust7: 109(21)
IntClust8: 140(34)
0.4 IntClust9: 67(24)
IntClust10: 96(30)

0.2

0.0
0 50 100 150
Months
Figure 5 | Classifying breast cancer using integrative clustering. Integrative clustering of 997 breast cancer cases
from the METABRIC cohort, based on segmented copy number and gene expression for the top 1,000 cis-acting copy
Nature Reviews | Cancer
number-expression associations. Heatmap showing the product of scaled gene expression and copy number values for
the selected features and for k = 10 clusters; columns represent breast cancer cases and rows represent features (part a).
Kaplan–Meier plot of disease-specific survival (truncated at 15 years) for the integrative subgroups. For each cluster, the
number of samples at risk is indicated as well as the total number of deaths in parentheses (part b). ER, oestrogen receptor;
ER NEG, ER negative; ER POS, ER positive; GI, genomic instability based on the proportion of genome altered (black line)
and jump measure (red line); grade, genomic grade; IntClust, groups found using integrative clustering with k = 10
clusters; NPI, Nottingham prognostic index; PAM50, gene expression subtyping based on the PAM50 gene signature.
Figure is reproduced, with permission, from REF. 49 © (2012) Macmillan Publishers Ltd. All rights reserved.

310 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

and mathematicians and transform our way of work- An enormous challenge is also the functional valida-
ing, thereby making data available on a hub so that tion of the in silico findings in relevant living biologi-
everyone who is interested in it can work on it. New cal systems, as well as the development of adequate
ideas can then be instantly picked up by anyone, rather in vitro functional studies (such as small interfering
than waiting for their publication. This was essential RNA screens, knock-in systems and knockout sys-
in the success of the DREAM BCC and is an example tems) to keep up with the increasing throughput by
of the one of the many computational challenges that which candidates for validation are generated. We still
have been set by DREAM, with the goal of catalys- need to explore functions of thousands of candidate
ing the interaction between theory and experiment, cancer genes and proteins to ascertain their value as
specifically in the area of cellular network inference risk factors, as predictive factors for therapy response
and quantitative model building in systems biology. and as therapeutic targets.

1. Hood, L., Heath, J. R., Phelps, M. E. & Lin, B. Systems 21. Kumar, V. et al. Radiomics: the process and the 39. Beroukhim, R. et al. The landscape of somatic copy-
biology and new technologies enable predictive and challenges. Magn. Reson. Imaging 30, 1234–1248 number alteration across human cancers. Nature 463,
preventative medicine. Science 306, 640–643 (2012). 899–905 (2010).
(2004). 22. Kilpinen, S. et al. Systematic bioinformatic analysis 40. Sun, Z. et al. Integrated analysis of gene expression,
2. Ideker, T., Galitski, T. & Hood, L. A new approach to of expression levels of 17,330 human genes CpG island methylation, and gene copy number in
decoding life: systems biology. Annu. Rev. Genomics across 9,783 samples from 175 types of healthy breast cancer cells by deep sequencing. PLoS ONE 6,
Hum. Genet. 2, 343–372 (2001). and pathological tissues. Genome Biol. 9, R139 e17490 (2011).
3. Auffray, C. & Hood, L. Editorial: Systems biology and (2008). 41. Ovaska, K. et al. Large-scale data integration
personalized medicine - the future is now. Biotechnol. 23. Wong, A. K. et al. IMP: a multi-species functional framework provides a comprehensive view on
J. 7, 938–939 (2012). genomics portal for integration, visualization and glioblastoma multiforme. Genome Med. 2, 65
This paper outlines the definitions and state of the prediction of protein functions and networks. Nucleic (2010).
art methodology in systems biology. Acids Res. 40, W484–W490 (2012). 42. Aure, M. R. et al. Identifying in‑trans process
4. Tian, Q., Price, N. D. & Hood, L. Systems cancer 24. Engreitz, J. M., Daigle, B. J., Marshall, J. J. & associated genes in breast cancer by integrated
medicine: towards realization of predictive, preventive, Altman, R. B. Independent component analysis: analysis of copy number and expression data. PLoS
personalized and participatory (P4) medicine. mining microarray data for fundamental human gene ONE 8, e53014 (2013).
J. Intern. Med. 271, 111–121 (2012). expression modules. J. Biomed. Inform. 43, 932–944 43. Chibon, F. et al. Validated prediction of clinical
5. Schadt, E. Eric Schadt. Interview by H. Craig Mak. (2010). outcome in sarcomas and multiple types of cancer on
Nature Biotech. 30, 769–770 (2012). 25. Engreitz, J. M. et al. ProfileChaser: searching the basis of a gene expression signature related to
6. Joyce, A. R. & Palsson, B. Ø. The model organism as a microarray repositories based on genome-wide genome complexity. Nature Med. 16, 781–787
system: integrating ‘omics’ data sets. Nat. Rev. Mol. patterns of differential expression. Bioinformatics 27, (2010).
Cell. Biol. 7, 198–210 (2006). 3317–3318 (2011). 44. Chari, R., Coe, B. P., Vucic, E. A., Lockwood, W. W. &
7. Martin, M. Semantic Web may be cancer information’s 26. Rhodes, D. R. et al. ONCOMINE: a cancer microarray Lam, W. L. An integrative multi-dimensional genetic
next step forward. J. Natl. Cancer Inst. 103, database and integrated data-mining platform. and epigenetic strategy to identify aberrant genes
1215–1218 (2011). Neoplasia 6, 1–6 (2004). and pathways in cancer. BMC Syst. Biol. 4, 67
8. Forbes, S. A. et al. COSMIC: mining complete cancer 27. Madhavan, S. et al. Rembrandt: helping personalized (2010).
genomes in the Catalogue of Somatic Mutations in medicine become a reality through integrative 45. Louhimo, R. & Hautaniemi, S. CNAmet: an R package
Cancer. Nucleic Acids Res. 39, D945–D950 (2011). translational research. Mol. Cancer Res. 7, 157–167 for integrating copy number, methylation and
9. Cheung, H. W. et al. Systematic investigation of (2009). expression data. Bioinformatics 27, 887–888
genetic vulnerabilities across cancer cell lines reveals This paper describes integrated genomic analyses (2011).
lineage-specific dependencies in ovarian cancer. in medicine. 46. R Core Team (2013). R: A language and
Proc. Natl Acad. Sci. USA 108, 12372–12377 28. Gentleman, R. C. et al. Bioconductor: open software environment for statistical computing. R Foundation
(2011). development for computational biology and for Statistical Computing, Vienna, Austria. URL
10. Martin, M. Rewriting the mathematics of tumor bioinformatics. Genome Biol. 5, R80 (2004). http://www.R-project.org/
growth. J. Natl Cancer Inst. 103, 1564–1565 29. Saito, R. et al. A travel guide to Cytoscape plugins. 47. Shen, Y., Sun, W. & Li, K.‑C. Dynamically weighted
(2011). Nature Methods 9, 1069–1076 (2012). clustering with noise set. Bioinformatics 26, 341–347
11. Forbes, S. A. et al. The Catalogue of Somatic 30. Cline, M. S. et al. Integration of biological networks (2010).
Mutations in Cancer (COSMIC). Curr. Protoc. Hum. and gene expression data using Cytoscape. Nature 48. Shen, R. et al. Integrative subtype discovery in
Genet. Chapter 10, Unit 10.11 (2008). Protocol. 2, 2366–2382 (2007). glioblastoma using iCluster. PLoS ONE 7, e35236
12. Stratton, M. R., Campbell, P. J. & Futreal, P. A. The This paper describes a widely used space for (2012).
cancer genome. Nature 458, 719–724 (2009). genomic analysis and visualization. 49. Curtis, C. et al. The genomic and transcriptomic
13. International Cancer Genome Consortium. 31. Gundem, G. et al. IntOGen: integration and data architecture of 2,000 breast tumours reveals novel
International network of cancer genome projects. mining of multidimensional oncogenomic data. Nature subgroups. Nature 486, 346–352 (2012).
Nature 464, 993–998 (2010). Methods 7, 92–93 (2010). 50. Yuan, Y., Savage, R. S. & Markowetz, F. Patient-specific
This is a description and the first results of the 32. Gonzalez-Perez, A. & López-Bigas, N. Functional data fusion defines prognostic cancer subtypes. PLoS
ICGC, a worldwide endeavour to characterize a impact bias reveals cancer drivers. Nucleic Acids Res. Comput. Biol. 7, e1002227 (2011).
wide range of tumours by next-generation 40, e169 (2012). 51. Bøvelstad, H. M. et al. Predicting survival from
sequencing. 33. Margolin, A. A. et al. Systematic analysis of challenge- microarray data—a comparative study. Bioinformatics
14. The Cancer Genome Atlas Research Network. The driven improvements in molecular prognostic models 23, 2080–2087 (2007).
Cancer Genome Atlas Pan-Cancer analysis project. for breast cancer. Sci. Transl. Med. 5, 181re1–181re1 52. Tibshirani, R. Regression shrinkage and selection via
Nature Genet. 45, 1113–1120 (2013). (2013). the Lasso. J. R. Statist. Soc. Series B. 58, 267–288
15. ENCODE Project Consortium. A user’s guide to the 34. Schadt, E. E., Linderman, M. D., Sorenson, J., Lee, L. (1996).
encyclopedia of DNA elements (ENCODE). PLoS Biol. & Nolan, G. P. Computational solutions to large-scale 53. Nowak, G., Hastie, T., Pollack, J. R. & Tibshirani, R.
9, e1001046 (2011). data management and analysis. Nature Rev. Genet. A fused lasso latent feature model for analyzing multi-
This is a genome-wide encyclopaedia of structural 11, 647–657 (2010). sample aCGH data. Biostatistics 12, 776–791
and regulatory elements in the genome. 35. Quigley, D. & Balmain, A. Systems genetics analysis of (2011).
16. Quigley, D. A. et al. The 5p12 breast cancer cancer susceptibility: from mouse models to humans. 54. Mankoo, P. K., Shen, R., Schultz, N., Levine, D. A. &
susceptibility locus affects MRPS30 expression in Nature Rev. Genet. 10, 651–657 (2009). Sander, C. Time to recurrence and survival in serous
estrogen-receptor positive tumors. Mol. Oncol. 8, 36. Lappalainen, T. et al. Transcriptome and genome ovarian tumors predicted from integrated genomic
273–284 (2013). sequencing uncovers functional variation in humans. profiles. PLoS ONE 6, e24709 (2011).
17. Fletcher, M. N. C. et al. Master regulators of FGFR2 Nature 501, 506–511 (2013). 55. Zou, H. & Hastie, T. Regularization and variable
signalling and breast cancer risk. Nature Commun. 4, This paper describes an integration of selection via the elastic net. J. R. Statist. Soc.: Series B
2464 (2013). next-generation sequencing data from DNA and (Statist. Methodol.) 67, 301–320 (2005).
18. Brower, V. Epigenetics: Unravelling the cancer code. RNA levels that reveals the structure of many 56. Segal, E., Friedman, N., Koller, D. & Regev, A. A
Nature 471, S12–13 (2011). regulatory elements. module map showing conditional activity of expression
19. Chin, L., Andersen, J. N. & Futreal, P. A. Cancer 37. Chin, K. et al. Genomic and transcriptional aberrations modules in cancer. Nature Genet. 36 1090–1098
genomics: from discovery science to personalized linked to breast cancer pathophysiologies. Cancer Cell (2004).
medicine. Nature Med.17, 297–303 (2011). 10, 529–541 (2006). This landmark publication establishes the
20. Yuan, Y. et al. Quantitative image analysis of cellular 38. Lando, M. et al. Gene dosage, expression, and principles of identification of regulatory modules.
heterogeneity in breast tumors complements genomic ontology analysis identifies driver genes in the 57. Kelder, T. et al. WikiPathways: building research
profiling. Sci. Transl. Med. 4, 157ra143–157ra143 carcinogenesis and chemoradioresistance of cervical communities on biological pathways. Nucleic Acids
(2012). cancer. PLoS Genet. 5, e1000719 (2009). Res. 40, D1301–D1307 (2012).

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 311

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

58. Rhee, S. Y., Wood, V., Dolinski, K. & Draghici, S. Use 82. Bauer, S., Gagneur, J. & Robinson, P. N. GOing 106. Shepherd, F. A. et al. Erlotinib in previously treated
and misuse of the gene ontology annotations. Nature Bayesian: model-based gene set analysis of genome- non-small-cell lung cancer. N. Engl. J. Med. 353,
Rev. Genet. 9, 509–515 (2008). scale data. Nucleic Acids Res. 38, 3523–3532 123–132 (2005).
59. Subramanian, A. et al. Gene set enrichment analysis: (2010). 107. Piccart-Gebhart, M. J. et al. Trastuzumab after
a knowledge-based approach for interpreting genome- 83. Newton, M. A., He, Q. & Kendziorski, C. A model- adjuvant chemotherapy in HER2‑positive breast
wide expression profiles. Proc. Natl Acad. Sci. USA based analysis to infer the functional content of a gene cancer. N. Engl. J. Med. 353, 1659–1672 (2005).
102, 15545–15550 (2005). list. Stat. Appl. Genet. Mol. Biol. 11, http://dx.doi. 108. Romond, E. H. et al. Trastuzumab plus adjuvant
60. Dittrich, M. T., Klau, G. W., Rosenwald, A., org/10.2202/1544-6115.1716 (2012). chemotherapy for operable HER2‑positive breast
Dandekar, T. & Müller, T. Identifying functional 84. Segal, E. et al. Module networks: identifying cancer. N. Engl. J. Med. 353, 1673–1684 (2005).
modules in protein-protein interaction networks: an regulatory modules and their condition-specific 109. Chapman, P. B. et al. Improved survival with
integrated exact approach. Bioinformatics 24, regulators from gene expression data. Nature Genet. vemurafenib in melanoma with BRAF V600E
i223–i231 (2008). 34, 166–176 (2003). mutation. N. Engl. J. Med. 364, 2507–2516
61. Qiu, Y.‑Q., Zhang, S., Zhang, X.‑S. & Chen, L. 85. Segal, E., Friedman, N., Kaminski, N., Regev, A. & (2011).
Detecting disease associated modules and prioritizing Koller, D. From signatures to models: understanding 110. Jonker, D. J. et al. Cetuximab for the treatment of
active genes based on high throughput data. BMC cancer using microarrays. Nature Genet. 37 S38–S45 colorectal cancer. N. Engl. J. Med. 357, 2040–2048
Bioinformatics 11, 26 (2010). (2005). (2007).
62. Guo, Z. et al. Edge-based scoring and searching 86. Vaske, C. J. et al. Inference of patient-specific pathway 111. Karapetis, C. S. et al. K‑ras mutations and benefit from
method for identifying condition-responsive protein- activities from multi-dimensional cancer genomics data cetuximab in advanced colorectal cancer. N. Engl.
protein interaction sub-network. Bioinformatics 23, using PARADIGM. Bioinformatics 26, i237–i245 J. Med. 359, 1757–1765 (2008).
2121–2128 (2007). (2010). 112. Iadevaia, S., Lu, Y., Morales, F. C., Mills, G. B. &
63. Chuang, H.‑Y. et al. Subnetwork-based analysis of This paper describes an application of approaches Ram, P. T. Identification of optimal drug
chronic lymphocytic leukemia identifies pathways that from the probabilistic graphical models in the combinations targeting cellular networks:
associate with disease progression. Blood 120, identification of pathways or dependencies integrating phospho-proteomics and computational
2639–2649 (2012). deviating from a given norm. network analysis. Cancer Res. 70, 6704–6714
64. Doniger, S. W. et al. MAPPFinder: using Gene 87. Kristensen, V. N. et al. Integrated molecular profiles of (2010).
Ontology and GenMAPP to create a global gene- invasive breast tumors and ductal carcinoma in situ 113. van de Vijver, M. J. et al. A gene-expression signature
expression profile from microarray data. Genome Biol. (DCIS) reveal differential vascular and interleukin as a predictor of survival in breast cancer. N. Engl.
4, R7 (2003). signaling. Proc. Natl Acad. Sci. USA 109, J. Med. 347, 1999–2009 (2002).
65. Tarca, A. L. et al. A novel signaling pathway impact 2802–2807 (2012). 114. Paik, S. et al. A multigene assay to predict recurrence
analysis. Bioinformatics 25, 75–82 (2009). 88. Ferkingstad, E., Frigessi, A. & Lyng, H. Indirect of tamoxifen-treated, node-negative breast cancer.
66. Efroni, S., Schaefer, C. F. & Buetow, K. H. genomic effects on survival from gene expression data. N. Engl. J. Med. 351, 2817–2826 (2004).
Identification of key processes underlying cancer Genome Biol. 9, R58 (2008). 115. Cooper, S. et al. Predicting protein structures with a
phenotypes using biologic pathway analysis. PLoS 89. Imoto, S. et al. Combining microarrays and biological multiplayer online game. Nature 466, 756–760
ONE 2, e425 (2007). knowledge for estimating gene networks via bayesian (2010).
67. Drier, Y., Sheffer, M. & Domany, E. Pathway-based networks. J. Bioinform. Comput. Biol. 2, 77–98 116. Radivojac, P. et al. A large-scale evaluation of
personalized analysis of cancer. Proc. Natl Acad. Sci. (2004). computational protein function prediction. Nature
USA 110, 6388–6393 (2013). 90. Bottolo, L. et al. Bayesian detection of expression Methods 10, 221–227 (2013).
68. Huttenhower, C. et al. Detailing regulatory networks quantitative trait loci hot spots. Genetics 189, 117. Cheng, W.‑Y., Ou Yang, T.‑H. & Anastassiou, D.
through large scale data integration. Bioinformatics 1449–1459 (2011). Biomolecular events in cancer revealed by attractor
25, 3267–3274 (2009). 91. Akavia, U. D. et al. An integrated approach to metagenes. PLoS Comput. Biol. 9, e1002920
69. Huttenhower, C. et al. Exploring the human genome uncover drivers of cancer. Cell 143, 1005–1017 (2013).
with functional maps. Genome Res. 19, 1093–1106 (2010). 118. Cheng, W.‑Y., Ou Yang, T.‑H. & Anastassiou, D.
(2009). 92. Birtwistle, M. R. et al. Ligand-dependent responses Development of a prognostic model for breast cancer
70. Mayer, C.‑D., Lorent, J. & Horgan, G. W. Exploratory of the ErbB signaling network: experimental and survival in an open challenge environment. Sci. Transl.
analysis of multiple omics datasets using the adjusted modeling analyses. Mol. Syst. Biol. 3, 144 (2007). Med. 5, 181ra50 –181ra50 (2013).
RV coefficient. Stat. Appl. Genet. Mol. Biol. 10, Article 93. Nik-Zainal, S. A. et al. The life history of 21 breast 119. Perou, C. M. et al. Molecular portraits of human
14 (2011). cancers. Cell 149, 994–1007 (2012). breast tumours. Nature 406, 747–752 (2000).
71. Quigley, D. A. et al. Genetic architecture of mouse skin 94. Shah, S. P. et al. The clonal and mutational evolution 120. Sørlie, T. et al. Gene expression patterns of breast
inflammation and tumour susceptibility. Nature 458, spectrum of primary triple-negative breast cancers. carcinomas distinguish tumor subclasses with clinical
505–508 (2009). Nature 486, 395–399 (2012). implications. Proc. Natl Acad. Sci. USA 98,
72. Lê Cao, K.‑A., González, I. & Déjean, S. integrOmics: 95. Cancer, Genome Atlas Research Network. 10869–10874 (2001).
an R package to unravel relationships between two Comprehensive genomic characterization defines 121. Sørlie, T. et al. Repeated observation of breast
omics datasets. Bioinformatics 25, 2855–2856 human glioblastoma genes and core pathways. Nature tumor subtypes in independent gene expression
(2009). 455, 1061–1068 (2008). data sets. Proc. Natl Acad. Sci. USA 100,
73. Langfelder, P. & Horvath, S. WGCNA: an R package for 96. Ciriello, G. et al. Emerging landscape of oncogenic 8418–8423 (2003).
weighted correlation network analysis. BMC signatures across human cancers. Nature Genet. 45, 122. Russnes, H. G. et al. Genomic architecture
Bioinformatics 9, 559 (2008). 1127–1133 (2013). characterizes tumor progression paths and fate in
74. Margolin, A. A., Wang, K., Califano, A. & 97. Zack, T. I. et al. Pan-cancer patterns of somatic copy breast cancer patients. Sci. Transl. Med. 2,
Nemenman, I. Multivariate dependence and genetic number alteration. Nature Genet. 45, 1134–1140 38ra47–38ra47 (2010).
networks inference. IET Syst. Biol. 4, 428–440 (2013). 123. Chin, S.‑F. et al. Using array-comparative genomic
(2010). 98. Cancer Genome Atlas Research Network. The Cancer hybridization to define molecular portraits of
75. Margolin, A. A. & Califano, A. Theory and Genome Atlas Pan-Cancer analysis project. Nature primary breast cancers. Oncogene 26, 1959–1970
limitations of genetic network inference from Genet. 45, 1113–1120 (2013). (2007).
microarray data. Ann. NY Acad. Sci. 1115, 51–72 99. Newman, M. E. J. Fast algorithm for detecting 124. Stephens, P. J. et al. The landscape of cancer genes
(2007). community structure in networks. Phys. Rev. E Stat. and mutational processes in breast cancer. Nature
76. Koller, D. & Friedman, N. Probabilistic graphical Nonlin Soft Matter Phys. 69, 066133 (2004). 486, 400–404 (2012).
models: principles and techniques. (Massachusetts 100. Louhimo, R., Lepikhova, T., Monni, O. & 125. Cancer Genome Atlas Research Network.
Institute of Technology, 2009). Hautaniemi, S. Comparative analysis of algorithms for Comprehensive molecular portraits of human breast
This study describes one of the basic approaches integration of copy number and expression data. tumours. Nature 490, 61–70 (2012).
for studying gene–gene dependencies. Nature Methods 9, 351–355 (2012). 126. Naume, B. et al. Presence of bone marrow
77. Califano, A., Butte, A. J., Friend, S., Ideker, T. & 101. Solvang, H. K., Lingjærde, O. C., Frigessi, A., micrometastasis is associated with different
Schadt, E. Leveraging models of cell regulation Børresen-Dale, A.‑L. & Kristensen, V. N. Linear and recurrence risk within molecular subtypes of breast
and GWAS data in integrative network-based non-linear dependencies between copy number cancer. Mol. Oncol. 1, 160–171 (2007).
association studies. Nature Genet. 44, 841–847 aberrations and mRNA expression reveal distinct 127. Nordgard, S. H. et al. Genome-wide analysis identifies
(2012). molecular pathways in breast cancer. BMC 16q deletion associated with survival, molecular
This paper describes a fundamental attempt to Bioinformatics 12, 197 (2011). subtypes, mRNA expression, and germline haplotypes
identify genotype–phenotype interactions. 102. Heiser, L. M. et al. Subtype and pathway specific in breast cancer patients. Genes Chromosomes Cancer
78. Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A. F. responses to anticancer compounds in breast 47, 680–696 (2008).
Discovering regulatory and signalling circuits in cancer. Proc. Natl Acad. Sci. USA 109, 2724–2729 128. Rønneberg, J. A. et al. Methylation profiling with a
molecular interaction networks. Bioinformatics 18 (2012). panel of cancer related genes: association with
(Suppl. 1), S233–240 (2002). 103. Hoshino, D. et al. Network analysis of the focal estrogen receptor, TP53 mutation status and
79. Breitling, R., Amtmann, A. & Herzyk, P. Graph- adhesion to invadopodia transition identifies a expression subtypes in sporadic breast cancer. Mol.
based iterative Group Analysis enhances PI3K‑PKCα invasive signaling axis. Sci. Signal. 5, ra66 Oncol. 5, 61–76 (2011).
microarray interpretation. BMC Bioinformatics 5, (2012). 129. Enerly, E. et al. miRNA-mRNA integrated analysis
100 (2004). 104. Stronach, E. A. et al. DNA‑PK mediates AKT reveals roles for mi­RNAs in primary breast tumors.
80. Ideker, T. & Krogan, N. J. Differential network biology. activation and apoptosis inhibition in clinically PLoS ONE 6, e16915 (2011).
Mol. Syst. Biol. 8, 565 (2012). acquired platinum resistance. Neoplasia 13, 130. Joshi, H., Bhanot, G., Børresen-Dale, A.‑L. &
81. Stingo, F. C. & Vannucci, M. Variable selection for 1069–1080 (2011). Kristensen, V. N. Potential tumorigenic programs
discriminant analysis with Markov random field priors 105. Mok, T. S. et al. Gefitinib or carboplatin-paclitaxel in associated with TP53 mutation status reveal role of
for the analysis of microarray data. Bioinformatics 27, pulmonary adenocarcinoma. N. Engl. J. Med. 361, VEGF pathway. Br. J. Cancer 107, 1722–1728
495–501 (2011). 947–957 (2009). (2012).

312 | MAY 2014 | VOLUME 14 www.nature.com/reviews/cancer

© 2014 Macmillan Publishers Limited. All rights reserved


REVIEWS

131. Stephens, P. J. et al. Complex landscapes of somatic 147. Wu, J., Mao, X., Cai, T., Luo, J. & Wei, L. KOBAS
rearrangement in human breast cancer genomes. server: a web-based platform for automated FURTHER INFORMATION
Nature 462, 1005–1010 (2009). annotation and pathway identification. Nucleic Acids Databases and sites for integrating tools:
132. Sun, Z. et al. Batch effect correction for genome-wide Res. 34, W720–W724 (2006). Cancer Genome Project: www.sanger.ac.uk/genetics/CGP/
methylation data with Illumina Infinium platform. BMC 148. Xie, C. et al. KOBAS 2.0: a web server for annotation Catalogue of Somatic Mutations in Cancer (COSMIC)
Med. Genom. 4, 84 (2011). and identification of enriched pathways and database: http://www.sanger.ac.uk/genetics/CGP/cosmic/
133. Strehl, A. & Ghosh, J. Cluster ensembles — a diseases. Nucleic Acids Res. 39, W316–W322 ENCyclopedia Of DNA Elements (ENCODE):
knowledge reuse framework for combining (2011). http://genome.ucsc.edu/ENCODE/
partitionings. Journal of Machine Learning 3, 149. Li, C. et al. SubpathwayMiner: a software package for International Cancer Genome Consortium (ICGC):
583–617 (2002). flexible identification of pathways. Nucleic Acids Res. www.icgc.org/
134. Monti, S., Tamayo, P., Mesirov, J. & Golub, T. 37, e131–e131 (2009). NCI/TCGA: http://cancergenome.nih.gov
Consensus clustering: a resampling-based method for 150. Chang, H.‑T. et al. Comprehensive analysis of The Cancer Genome Atlas (TCGA): www.cancergenome.nih.gov
class discovery and visualization of gene expression microRNAs in breast cancer. BMCGenomics 13, S18
Storage and compute spaces:
microarray data. Machine Learn. 52, 91–118 (2012).
Bioconductor: http://www.bioconductor.org/
(2003). 151. Tamborero, D., Lopez-Bigas, N. &
Bionimbus: http://www.bionimbus.org/
135. Collisson, E. A. et al. Subtypes of pancreatic ductal Gonzalez-Perez, A. Oncodrive-CIS: a method to
CytoScape: http://www.cytoscape.org/
adenocarcinoma and their differing responses to reveal likely driver genes based on the impact of
Federation of SAGE: http://sagebase.org/
therapy. Nature Med. 17, 500–503 (2011). their copy number changes on expression. PLoS
Synapse: https://synapse.prod.sagebase.org/
136. Lancichinetti, A. & Fortunato, S. Consensus clustering ONE 8, e55489 (2013).
in complex networks. Sci. Rep. 2, 336 (2012). 152. Warsow, G. et al. ExprEssence—revealing the essence Protein–protein interactions:
137. Lee, M. & Kim, Y. CHESS (CgHExpreSS): a of differential experimental data in the context of an HPRD: www.hprd.org/
comprehensive analysis tool for the analysis of interaction/regulation net-work. BMC Syst. Biol. 4, Kyoto Encyclopedia of Genes and Genomes (KEGG):
genomic alterations and their effects on the 164 (2010). www.genome.jp/kegg
expression profile of the genome. BMC Bioinformatics 153. Deshpande, R., Sharma, S., Verfaillie, C. M., Hu, W.‑S. MIPS (Mammalian protein–protein interaction):
10, 424 (2009). & Myers, C. L. A scalable approach for discovering http://mips.helmholtz-muenchen.de/proj/ppi/
138. Shen, R., Olshen, A. B. & Ladanyi, M. Integrative conserved active subnetworks across species. PLoS PID Pathway Interaction Database (NCI): www.pid.nci.nih.gov
clustering of multiple genomic data types using a joint Comput. Biol. 6, e1001028 (2010). Reactome: www.reactome.org
latent variable model with application to breast and 154. Goffard, N., Frickey, T. & Weiller, G. PathExpress WikiPathways: http://wikipathways.org/
lung cancer subtype analysis. Bioinformatics 25, update: the enzyme neighbourhood method of
2906–2912 (2009). associating gene-expression data with metabolic Annotation, visualization and integrated discovery:
139. Leday, G. G. R. & van de Wiel, M. A. PLRS: a flexible pathways. Nucleic Acids Res. 37, W335–W339 Biowaver: http://sonorus.princeton.edu/bioweaver/
tool for the joint analysis of DNA copy number and (2009). DAVID: http://david.abcc.ncifcrf.gov
mRNA expression data. Bioinformatics 29, 155. Bryant, W. A., Sternberg, M. J. E. & Pinney, J. W. GOLEM: http://reducio.princeton.edu/GOLEM/
1081–1082 (2013). AMBIENT: Active Modules for Bipartite Networks— GRIFn: http://reducio.princeton.edu/GRIFn/
140. Chen, B.‑J. et al. Harnessing gene expression to using high-throughput transcriptomic data to HEFalMp: http://hefalmp.princeton.edu/
identify the genetic basis of drug resistance. Mol. Syst. dissect metabolic response. BMC Syst. Biol. 7, 26 Mefit: http://avis.princeton.edu/mefit
Biol. 5, 310 (2009). (2013). MsigDB Molecular Signatures Database:
141. Yuan, Y., Curtis, C., Caldas, C. & Markowetz, F. A. 156. Kirk, P., Griffin, J. E., Savage, R. S., Ghahramani, Z. www.broadinstitute.org/gsea/msigdb/index.jsp
Sparse regulatory network of copy-number driven & Wild, D. L. Bayesian correlated clustering to Oncomine: https://www.oncomine.org/resource/login.html
gene expression reveals putative breast cancer integrate multiple datasets. Bioinformatics 28, Rembrandt:
oncogenes. IEEE/ACM Trans. Comput. Biol. Bioinform. 3290–3297 (2012). http://cabig.cancer.gov/action/collaborations/rembrandt/
9, 947–954 (2012). 157. Brodtkorb, M. et al. Whole-genome integrative Search-Based Exploration of Expression Compendium
142. Carro, M. S. et al. The transcriptional network for analysis reveals expression signatures predicting (SEEK): http://seek.princeton.edu
mesenchymal transformation of brain tumours. Nature transformation in follicular lymphoma. Blood, 123, Sleipnir: http://libsleipnir.bitbucket.org/
463, 318–325 (2010). 1051–1054 (2014). Summary of gene ontology tools:
143. Saadi, A. et al. Stromal genes discriminate http://www.geneontology.org/GO.tools.microarray.shtml
preinvasive from invasive disease, predict outcome, Acknowledgements
and highlight inflammatory pathways in digestive The authors thank numerous collaborators, most notably D. Omics integration:
cancers. Proc. Natl Acad. Sci. USA 107, 2177–2182 Quigley, R. Sachidanandam, S. Hautaniemi, P. van Loo and C. Combinatorial ALgorithm for Expression and Sequence-
(2010). Vaske for the critical reading of the manuscript and for shar- based Cluster Extraction (COALESCE):
144. Hamatani, T. et al. Global gene expression analysis ing their overview of the field and valuable discussions. http://reducio.princeton.edu/cm/coalesce
identifies molecular pathways distinguishing blastocyst Special thanks to C. Perou and C. Creighton of The Cancer COpy Number and EXpression In Cancer (CONNEXIC):
dormancy and activation. Proc. Natl Acad. Sci. USA Genome Atlas (TCGA) and O. Rueda and C. Caldas of the http://www.c2b2.columbia.edu/danapeerlab/html/software.html
101, 10326–10331 (2004). METABRIC study, as well as M. M. Holmen, from Oslo DR-Integrator: http://pollacklab.stanford.edu/
145. Draghici, S. et al. A systems biology approach for University Hospital for providing original images. The authors IntOGen: http://bg.upf.edu/group/tools.php#intogen
pathway level analysis. Genome Res. 17, 1537–1545 also thank the Norwegian Cancer Society, the K.G. Jebsen Magellan: http://cabig.nci.nih.gov/
(2007). Foundation, the Norwegian Research Council, Health Region OncoDrive: http://bg.upf.edu/blog/tag/oncodrive/
146. Engström, P. G. et al. Digital transcriptome South East, and the Norwegian Radium Hospital’s Foundation PAthway Recognition Algorithm using Data Integration on
profiling of normal and glioblastoma-derived for financial support over many years. Genomic Models (PARADIGM):
neural stem cells identifies genes associated http://sbenz.github.com/Paradigm
with patient survival. Genome Med. 4, 76 Competing interests statement ALL LINKS ARE ACTIVE IN THE ONLINE PDF
(2012). The authors declare no competing interests.

NATURE REVIEWS | CANCER VOLUME 14 | MAY 2014 | 313

© 2014 Macmillan Publishers Limited. All rights reserved

You might also like