NGS
NGS
NGS
Data Analysis
Xinkun Wang
Next-Generation Sequencing
Data Analysis
Next-Generation Sequencing
Data Analysis
Xinkun Wang
Northwestern University
Chicago, Illinois, USA
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2016 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
v
vi Contents
7. Transcriptomics by RNA-Seq..................................................................... 97
7.1 Principle of RNA-Seq........................................................................... 97
7.2 Experimental Design........................................................................... 98
7.2.1 Factorial Design...................................................................... 98
7.2.2 Replication and Randomization........................................... 98
7.2.3 Sample Preparation................................................................ 99
7.2.4 Sequencing Strategy............................................................. 100
7.3 RNA-Seq Data Analysis.................................................................... 101
7.3.1 Data Quality Control and Reads Mapping....................... 101
7.3.2 RNA-Seq Data Normalization............................................ 103
7.3.3 Identification of Differentially Expressed Genes............. 105
7.3.4 Differential Splicing Analysis............................................. 107
7.3.5 Visualization of RNA-Seq Data.......................................... 108
7.3.6 Functional Analysis of Identified Genes........................... 108
7.4 RNA-Seq as a Discovery Tool........................................................... 109
Introduction to Cellular
and Molecular Biology
1
The Cellular System and the Code of Life
3
4 Next-Generation Sequencing Data Analysis
1.3 Molecules in Cells
Different types of molecules are needed to carry out the various cellular pro-
cesses. In a typical cell, water is the most abundant, representing 70% of the
total cell weight. Besides water, there is a large variety of small and large
The Cellular System and the Code of Life 5
1.4.1 Nucleus
Since DNA stores the code of life, it must be protected and properly main-
tained to avoid possible damage, and ensure accuracy and stability. As
proper execution of the genetic information embedded in the DNA is critical
6 Next-Generation Sequencing Data Analysis
Nucleus
Nuclear envelope
(with nuclear pores)
Chromatin Cell membrane
Peroxisome
Nucleolus Ribosome
Microtubule
Lysosome
Mitochondrion
Golgi apparatus
Rough ER
Smooth ER
Intermediate
filament
Centrosome
Cytoplasm Endosome
Microfilament
FIGURE 1.1
The general structure of a typical eukaryotic cell. Shown here is an animal cell.
to the normal functioning of a cell, gene expression must also be tightly reg-
ulated under all conditions. The nucleus, located in the center of most cells
in eukaryotes, offers a well-protected environment for DNA storage, main-
tenance, and gene expression. The nuclear space is enclosed by a nuclear
envelope consisting of two concentric membranes. To allow movement of
proteins and RNAs across the nuclear envelope, which is essential for gene
expression, there are pores on the nuclear envelope that span the inner and
outer membrane. The mechanical support of the nucleus is provided by the
nucleoskeleton, a network of structural proteins called lamins. Inside the
nucleus, long strings of DNA molecules, through binding to certain proteins
called histones, are heavily packed to fit into the limited nuclear space. In
prokaryotic cells, a nucleus-like irregularly shaped region that does not have
a membrane enclosure called the nucleoid, provides a similar but not as well-
protected space for DNA.
1.4.2 Cell Membrane
The cell membrane serves as a barrier to protect the internal structure of a
cell from the outside environment. Biochemically, the cell membrane, as well
as all other intracellular membranes such as the nuclear envelope, assumes a
The Cellular System and the Code of Life 7
1.4.3 Cytoplasm
Inside the cell membrane, cytoplasm is the thick solution that contains the
majority of cellular substances, including all organelles in eukaryotic cells
but excluding the nucleus in eukaryotic cells and the DNA in prokaryotic
cells. The general fluid component of the cytoplasm that excludes the organ-
elles is called the cytosol. The cytosol makes up more than half of the cel-
lular volume and is where many cellular activities take place, including a
large number of metabolic steps such as glycolysis and interconversion of
molecules and most signal transduction steps. In prokaryotic cells, due to
the lack of a nucleus and other specialized organelles, the cytosol is almost
the entire intracellular space and where most cellular activities take place.
Besides water, the cytosol contains large amounts of small and large
molecules. Small molecules, such as inorganic ions, provide an overall bio-
chemical environment for cellular activities. In addition, ions such as Na+,
K+, and Ca2+ also have substantial concentration differences between the
cytosol and the extracellular space. Cells spend a lot of energy maintaining
these concentration differences, and use them for signaling and metabolic
purposes. For example, the concentration of Ca2+ in the cytosol is normally
kept very low at ~10 –7 M, whereas in the extracellular space it is ~10 –3 M.
The rushing in of Ca2+ under certain conditions through ligand- or voltage-
gated channels serves as an important messenger, inducing responses in a
number of signaling pathways, some of which lead to altered gene expres-
sion. Besides small molecules, the cytosol also contains large numbers of
macromolecules. Far from being simply randomly diffusing in the cytosol,
8 Next-Generation Sequencing Data Analysis
in the peroxisome but without generating any energy, thereby rendering the
peroxisome largely irrelevant except for carrying out the remnant oxidative
functions.
1.4.5 Ribosome
Ribosome is the protein assembly factory in cells, translating genetic infor-
mation carried in messenger RNAs (mRNAs) into proteins. There are vast
numbers of ribosomes, usually from thousands to millions, in a typical cell.
Whereas both prokaryotic and eukaryotic ribosomes are composed of two
components (or subunits), eukaryotic ribosomes are larger than their pro-
karyotic counterparts. In eukaryotic cells, the two ribosomal subunits are
first assembled inside the nucleus in a region called the nucleolus and then
shipped out to the cytoplasm. In the cytoplasm, ribosomes can be either
free or get attached to another organelle (the ER). Biochemically, ribosomes
contain more than 50 proteins and several ribosomal RNA (rRNA) species.
Because ribosomes are highly abundant in cells, rRNAs are the most abun-
dant in total RNA extracts, accounting for 85% to 90% of all RNA species.
For profiling cellular RNA populations using next-generation sequencing
(NGS), rRNAs are usually not of interest despite their abundance and there-
fore need to be depleted to avoid generation of overwhelming amounts of
sequencing reads from them.
1.4.7 Golgi Apparatus
Besides the ER, the Golgi apparatus also plays an indispensable role in sort-
ing as well as dispatching proteins to the cell membrane, extracellular space,
or other subcellular destinations. Many proteins synthesized in the ER are
sent to the Golgi apparatus via small vesicles for further processing before
being sent to their final destinations. Therefore, the Golgi apparatus is some-
times metaphorically described as the “post office” of the cell. The process-
ing carried out in this organelle includes chemical modification of some of
the proteins, such as adding oligosaccharide side chains, which serve as
“address labels.” Other important functions of the Golgi apparatus include
synthesizing carbohydrates and extracellular matrix materials, such as the
polysaccharide for the building of the plant cell wall.
1.4.8 Cytoskeleton
Cellular processes like the trafficking of proteins in vesicles from the ER to
the Golgi apparatus or the movement of a mitochondrion from one intracellu-
lar location to another are not simply based on diffusion. Rather, they follow
a certain protein-made skeletal structure inside the cytosol, that is, the cyto-
skeleton, as tracks. Besides providing tracks for intracellular transport, the
cytoskeleton, like the skeleton in the human body, plays an equally impor-
tant role in maintaining cell shape and protecting the cell framework from
physical stresses, as the lipid bilayer cell membrane is fragile and vulnerable
to such stresses. In eukaryotic cells, there are three major types of cytoskel-
etal structures: microfilament, microtubule, and intermediate filament. Each
type is made of distinct proteins and has its own unique characteristics and
functions. For example, microfilament and microtubule are assembled from
actins and tubulins, respectively, and have different thicknesses (the diam-
eter is about 6 nm for microfilament and 23 nm for microtubule). Although
biochemically and structurally different, both the microfilament and the
microtubule have been known to provide tracks for mRNA transport in the
form of large ribonucleoprotein complexes to specific intracellular sites, such
as the distal end of a neuronal dendrite, for targeted protein translation [3,4].
Besides its role in intracellular transportation, the microtubule also plays a
key role in cell division through attaching to the duplicated chromosomes
and moving them equally into two daughter cells. In this process, all micro-
tubules involved are organized around a small organelle called centrosome.
Previously thought to be only present in eukaryotic cells, cytoskeletal struc-
tures have also been discovered in prokaryotic cells [5].
1.4.9 Mitochondrion
The mitochondrion is the “powerhouse” in eukaryotic cells. While some
energy is produced from the glycolytic pathway in the cytosol, most energy
The Cellular System and the Code of Life 11
is generated from the Krebs cycle and the oxidative phosphorylation process
that take place in the many mitochondria contained in a cell. The number of
mitochondria in a cell is ultimately dependent on its energy demand. The
more energy a cell needs, the more mitochondria it has. Structurally, the mito-
chondrion is an organelle enclosed by two membranes. The outer membrane
is highly permeable to most cytosolic molecules, and as a result the inter-
membrane space between the outer and inner membranes is similar to the
cytosol. Most of the energy-releasing process occurs in the inner membrane
and in the matrix, that is, the space enclosed by the inner membrane. For
the energy release, high-energy electron carriers generated from the Krebs
cycle in the matrix are fed into an electron transport chain embedded in the
inner membrane. The energy released from the transfer of high-energy elec-
trons through the chain to molecular oxygen (O2), the final electron acceptor,
creates a proton gradient across the inner membrane. This proton gradient
serves as the energy source for the synthesis of ATP, the universal energy
currency in cells. In prokaryotic cells, since they do not have this organelle,
ATP synthesis takes place on their cytoplasmic membrane instead.
The origin of the mitochondrion, based on the widely accepted endosym-
biotic theory, is an ancient α-proteobacteria. So not surprisingly, the mito-
chondrion carries its own DNA, but the genetic information contained in
mitochondrial DNA (mtDNA) is extremely limited compared to nuclear
DNA. Human mitochondrial DNA, for example, is 16,569 bp in size coding
for 37 genes, including 22 for transfer RNAs (tRNAs), 2 for rRNAs, and 13 for
mitochondrial proteins. Although it is much smaller compared to the nuclear
genome, there are multiple copies of mtDNA molecules in each mitochon-
drion. Since cells usually contain hundreds to thousands of mitochondria,
there are a large number of mtDNA molecules in each cell. In comparison,
most cells only contain two copies of the nuclear DNA. As a result, when
sequencing cellular DNA samples, sequences derived from mitochondrial
DNA usually comprise a notable, sometimes substantial, percentage of total
generated reads. Although small, the mitochondrial genomic system is fully
functional and has the entire set of protein factors for mtDNA transcription,
translation, and replication. As a result of its activity, when cellular RNA
molecules are sequenced, those transcribed from the mitochondrial genome
also generate significant amounts of reads in the sequence output.
The many copies of mtDNA molecules in a cell may not all have the same
sequence due to mutations in individual molecules. Heteroplasmy occurs
when cells contain a heterogeneous set of mtDNA molecules. In general, mito-
chondrial DNA has a higher mutation rate than its nuclear counterpart. This
is because the transfer of high-energy electrons along the electron transport
chain can produce reactive oxygen species as byproducts, which can oxidize
and cause mutations in mtDNA. To make this situation even worse, the DNA
repair capability in mitochondria is rather limited. Increased heteroplasmy
has been associated with a higher risk of developing aging-related diseases,
including Alzheimer disease, heart disease, and Parkinson’s disease [6–8].
12 Next-Generation Sequencing Data Analysis
1.4.10 Chloroplast
In animal cells, the mitochondrion is the only organelle that contains an
extranuclear genome. Plant and algae cells have another extranuclear
genome besides the mitochondrion, the plastid genome. Plastid is an organ-
elle that can differentiate into various forms, the most prominent of which
is the chloroplast. The chloroplast carries out photosynthesis by capturing
the energy in sunlight and fixing it into carbohydrates using carbon dioxide
as substrate, and releasing oxygen in the same process. For energy captur-
ing, the green pigment called chlorophyll first absorbs energy from sunlight,
which is then transferred through an electron transport chain to build up
a proton gradient to drive the synthesis of ATP. Despite the energy source,
the buildup of proton gradient for ATP synthesis in the chloroplast is very
similar to that for ATP synthesis in the mitochondrion. The chloroplast ATP
derived from the captured light energy is then spent on CO2 fixation. Similar
to the mitochondrion, the chloroplast also has two membranes, a highly per-
meable outer membrane and a much less permeable inner membrane. The
photosynthetic electron transport chain, however, is not located in the inner
membrane but in the membrane of a series of saclike structures called thy-
lakoids located in the chloroplast stroma (analogous to the mitochondrial
matrix).
Plastid is believed to have evolved from an endosymbiotic cyanobaterium,
which has gradually lost the majority of its genes in its genome over millions
of years. The current size of most plastid genomes is 100 to 200 kb, coding for
rRNAs, tRNAs, and proteins. In higher plants there are about 85 genes cod-
ing for various proteins of the photosynthetic system [10]. The transmission
of plastid DNA (ptDNA) from parent to offspring is more complicated than
the maternal transmission of mtDNA usually observed in animals. Based
on the transmission pattern, it can be classified into three types: (1) mater-
nal, inheritance only through the female parent; (2) paternal, inheritance
only through the male parent; or (3) bioparental, inheritance through both
parents [11]. Similar to the situation in mitochondrion, there exist multiple
copies of ptDNA in each plastid, and as a result there are large numbers
of ptDNA molecules in each cell with potential heteroplasmy. Transcription
from these ptDNA also generates copious amounts of RNAs in the organelle.
Therefore, sequence reads from ptDNA or RNA comprise part of the data
when sequencing plant and algae DNA or RNA samples, along with those
from mtDNA or RNA.
The Cellular System and the Code of Life 13
tasks in a cell that need to be carried out by genetic circuits, from the trans-
duction of extracellular signal to the inside, to the step-by-step breakdown
of energy molecules (such as glucose) to release energy, to the replication
of DNA prior to cell division. It is these genetic circuits that underlie cellu-
lar behavior and physiology. If the information or material flux in a genetic
circuit is blocked or goes awry, the whole system will be influenced, which
might lead to the malfunctioning of the system and likely a diseased state.
Based on the hierarchical organization principle of systems, gene circuits
interact with each other and form a complicated genetic network. Mapping
out a genetic network is a higher goal of systems biology. A genetic network
has been shown to share some common characteristics with nonbiological
networks such as the human society or the Internet [12]. One such charac-
teristic is modularity, for instance, when genes (or proteins) that often work
together to achieve a common goal form a module and the module is used
as a single functional unit when needed. Another common characteristic is
the existence of hub or anchor nodes in the network; for example, a small
number of highly connected genes (or proteins) in a genetic network serve
as hubs or anchors through which other genes (or proteins) are connected to
each other.
17
18 Next-Generation Sequencing Data Analysis
5′ A P C P G P A P T 3′
3′ T P G P C P T P A P G P C P C 5′
Extension T P P P
P P + H+
5′ A P C P G P A P T P C 3′
3′ T P G P C P T P A P G P C P C 5′
+ H+ Misincorporation
P P
5′ A P C P G P A P T P C P T 3′
3′ T P G P C P T P A P G P C P C 5′
Error correction
P T G P P P
5′ A P C P G P A P T P C 3′
3′ T P G P C P T P A P G P C P C 5′
Extension
P P + H+
5′ A P C P G P A P T P C P G 3′
3′ T P G P C P T P A P G P C P C 5′
FIGURE 2.1
The DNA replication process. To initiate the process, a primer, which is a short DNA sequence
complementary to the start region of the DNA template strand, is needed for DNA polymerase
to attach nucleotides and extend the new strand. The attachment of nucleotides is based on
complementary base-pairing with the template. If an error occurs due to mispairing, the DNA
polymerase removes the mispaired nucleotide using its proofreading function. Due to the bio-
chemical structure of the DNA molecule, the direction of the new strand elongation is from
its 5′ end to the 3′ end (the template strand is in the opposite direction; the naming of the two
ends of each DNA strand as 5′ and 3′ is from the numbering of carbon atoms in the nucleotide
sugar ring).
20 Next-Generation Sequencing Data Analysis
Promoter 3′-UTR
DNA (gene)
RNA
(primary transcript)
mRNA
Exon
Protein
Intron
FIGURE 2.2
The central dogma.
DNA Sequence 21
proteins through alternative splicing (see Chapter 3). In addition, the informa-
tion flow between DNA and RNA is not simply one-way from DNA to RNA,
but RNA can also be reverse transcribed to DNA in some organisms. On the
additional role of RNAs in this information flow, some non-protein-coding
RNAs can silence gene expression through mechanisms such as inhibiting
gene transcription or translation, or protect genomes through mechanisms
like preventing the movement of transposable elements (or transposons,
mobile DNA elements that copy themselves to different genomic loci) (also
see Chapter 3). Furthermore, chemical modifications of DNA and some DNA-
interacting proteins constitute the epigenome, which also regulates the flow
of genetic information.
2.4.2 Genome Sizes
For the least sophisticated organisms, such as Mycoplasma genitalium, a mini-
mal genome is sufficient. For increased organismal complexity, more genetic
information and, therefore, a larger genome is needed. As a result, there
is a positive correlation between organismal complexity and genome size,
especially in prokaryotes. In eukaryotes, however, this correlation becomes
much weaker, largely due to the existence of noncoding DNA elements in
22 Next-Generation Sequencing Data Analysis
TABLE 2.1
Genome Sizes and Total Gene Numbers in Major Model Organisms (Ordered
by Genome Size)
Organism Genome Size (bp)a Number of Coding Genes
Mycoplasma genitalium (bacterium) 580,076 476
Haemophilus influenzae (strain 86-028NP) 1,914,490 1792
(bacterium)
Escherichia coli (strain K-12) (bacterium) 4,646,332 4227
Saccharomyces cerevisiae (yeast) 12,157,105 6692
Caenorhabditis elegans (nematode) 103,022,290 20,447
Arabidopsis thaliana (thale cress) 135,670,229 27,416
Drosophila melanogaster (fruit fly) 168,736,537 13,937
Medicago truncatula (legume) 309,576,036 44,115
Oryza sativa (japonica subspecies) (rice) 374,424,240 35,679
Danio rerio (zebrafish) 1,505,581,940 26,459
Rattus norvegicus (rat) 2,573,362,844 22,777
Zea mays (maize) 3,233,616,351 39,469
Homo sapiens (human) 3,381,944,086 20,364
Mus musculus (mouse) 3,482,005,469 22,606
a Data based on Ensembl genome databases as of February 2015.
363 exons, the most in any single gene, and also has the longest single exon
(17,106 bp) among all currently known exons. The total number of currently
known exons in the human genome is around 180,000. With a combined
size of 30 Mb, they constitute 1% of the human genome. This collection of
all exons in the human genome, or in other eukaryotic genomes, is termed
as the exome. Different from the transcriptome, which is composed of all
actively transcribed mRNAs in a particular sample, the exome includes all
exons contained in a genome. Although it only covers a very small percent-
age of the genome, the exome represents the most important and the best
annotated part of the genome. Sequencing of the exome has been used as
a popular alternative to whole genome sequencing. While it lacks on cov-
erage, exome sequencing is more cost effective, faster, and easier for data
interpretation.
Exons
(1.5%)
Repetitive
sequences not
related to
transposons
(14%) Introns
(20%)
Regulatory
sequences
(5%)
Repetitive
sequences Other unique
related to noncoding DNA
transposons sequences
(44.5%) (15%)
FIGURE 2.3
The composition of the human genome.
2.5.2 Sequence Access
Since different DNA sequences in the genome are constantly being tran-
scribed, instead of being permanently locked into the compacted form, DNA
sequences at specific loci need to be dynamically exposed to allow tran-
scriptional access to protein factors such as transcription factors and coacti-
vators. Furthermore, DNA replication and repair also require chromatin
unpackaging. This unpackaging of the chromatin structure is carried out
through two principal mechanisms. One is through histone modification,
such as acetylation of lysine residues on histones by histone acetyltransfer-
ases, which reduces the positive charge on histones and therefore decreases
the electrostatic interactions between histones and DNA. Deacetylation by
histone deacetylases, on the other hand, restricts DNA access and represses
transcription. The other unpackaging mechanism is through the actions of
chromatin remodeling complexes. These large protein complexes consume
26 Next-Generation Sequencing Data Analysis
ATP and use the released energy to expose DNA sequences for transcrip-
tion through nucleosomal repositioning, nucleosomal eviction, or local
unwrapping.
2.5.3 DNA–Protein Interactions
While DNA is the carrier of the code of life, the DNA code cannot be exe-
cuted without DNA-interacting proteins. Nearly all of the processes men-
tioned earlier, including DNA packaging/unpackaging, transcription,
repair, and replication, rely on such proteins. Besides histones, examples of
these proteins include transcription factors, RNA polymerases, DNA poly-
merases, and nucleases (for DNA degradation). Many of these proteins, such
as histones and DNA/RNA polymerases, interact with DNA regardless of
their sequence or structure. Some DNA-interacting proteins bind to DNA
of special structure/conformation, for example, high-mobility group (HMG)
proteins that have high affinity for bent or distorted DNA. Some other DNA-
interacting proteins bind only to regions of the genome that have certain
characteristics such as having damage, the examples of which are DNA
repair enzymes such as BRCA1, BRCA2, RAD51, RAD52, and TDG.
The most widely studied DNA-interacting proteins are transcription
factors, which bind to specific DNA sequences. Through binding to their
specific recognition sequences in the genome, transcription factors regu-
late transcription of gene targets that contain such sequences in their pro-
moter region. Since they bind to more than one gene location in the genome,
transcription factors regulate the transcription of a multitude of genes in
a coordinated fashion, usually as a response to certain internal or external
environmental changes. For instance, NRF2 is a transcription factor that is
activated in response to oxidative stress. Upon activation, it binds to a short
segment of a specific DNA sequence called the antioxidant response ele-
ment (ARE), located in the promoter region of those genes that are respon-
sive to oxidative stress. Through binding to this sequence element in many
regions of the genome, NRF2 regulates the transcription of its target genes
and thereby elicits coordinated responses to counteract the damaging effects
of oxidative stress.
Study of DNA–protein interactions provides insights into how the genome
responds to various conditions. For example, determination of transcrip-
tion factor binding sites, such as those of NRF2, across the genome can
unravel what genes might be responsive to the conditions that activate the
transcription factors. Although such sites can be predicted computationally,
only a wet-lab experiment can determine where a transcription factor actu-
ally binds in the genome under a certain condition. ChIP-Seq, or chromatin
immunoprecipitation coupled with sequencing, is one application of next-
generation sequencing (NGS) that is developed to study genomic binding of
transcription factors and other DNA-interacting proteins. Chapter 11 focuses
on ChIP-Seq data analysis.
DNA Sequence 27
resolution when scanning for genomic region(s) that are associated with a
phenotype or disease of interest.
Besides single nucleotide substitutions, indels are another common type
of mutation. Most indels involve small numbers of nucleotides. In protein-
coding regions, small indels lead to the shift of ORF (unless the number of
nucleotides involved is a multiple of three), resulting in the formation of a
vastly different protein product. Indels that involve large regions (>1 kb in
size) lead to alterations of genomic structure and are usually considered as
a form of SV. Besides large indels, SVs also include inversions, transloca-
tions, or duplications that involve alterations of larger DNA regions (typi-
cally >1 kb). Copy number variation (CNV) is a subcategory of SV, usually
caused by large indel or segmental duplication. Although they affect larger
genomic region(s) and some lead to observable phenotypic changes or
diseases, many CNVs, or SVs in general, have no detectable effects. The
frequency of SVs in the genome was previously underestimated due to
technological limitations. The emergence of NGS has greatly enabled SV
detection, which has led to the realization of its wide existence [19].
2.7 Genome Evolution
The spontaneous mutations that lead to sequence variation and polymor-
phism in a population are also the fundamental force behind the evolution
of genomes and eventually the Darwinian evolution of the host organisms.
Gradual sequence change and diversification of early genomes, over billions
of years, have evolved into the extremely large number of genomes that
had existed or are functioning in varying complexity today. In this process,
existing DNA sequences are constantly modified, duplicated, and reshuffled.
Most mutations in protein-coding or regulatory sequences disrupt the pro-
tein’s normal function or alter its amount in cells, causing cellular dysfunc-
tion and affecting organismal survival. Under rare conditions, however, a
mutation can improve existing protein function or lead to the emergence of
new functions. If such a mutation offers its host a competitive advantage, it is
more likely to be selected and passed on to future generations.
Gene duplication provides another major mechanism for genome evolu-
tion. If a genomic region containing one or multiple gene(s) is duplicated
resulting in the formation of an SV, the duplicated region is not under selec-
tion pressure and therefore becomes substrate for sequence divergence and
new gene formation. Although there are other ways of adding new genetic
information to a genome such as interspecies gene transfer, DNA duplica-
tion is believed to be a major source of new genetic information generation.
Gene duplication often leads to the formation of gene families. Genes in the
same family are homologous, but each member has its specific function and
DNA Sequence 29
expression pattern. As an example, in the human genome there are 339 genes
in the olfactory receptor gene family. Odor perception starts with the bind-
ing of odorant molecules to olfactory receptors located on olfactory neurons
inside the nose epithelium. To detect different odorants, a combination of dif-
ferent olfactory receptors that are coded by genes in this family is required.
Based on their sequence homology, members of this large family can be even
further grouped into different subfamilies [20]. The existence of pseudo-
genes in the genome is another result of gene duplication. After duplication,
some genes may lose their function and become inactive from additional
mutation. Pseudogenes may also be formed in the absence of duplication
by the disabling of a functional gene from mutation. A pseudogene called
GULO mapped to the human chromosome 8p21 provides such an example.
The functional GULO gene in other organisms codes for an enzyme that
catalyzes the last step of ascorbic acid (vitamin C) biosynthesis. This gene is
knocked out in primates, including humans, and becomes a pseudogene. As
a result, we have to get this essential vitamin from food. The inactivation of
this gene is possibly due to the insertion to the gene’s coding sequence of a
retrotransposon-type repetitive sequence called Alu element [21].
DNA recombination, or reshuffling of DNA sequences, also plays an
important role in genome evolution. Although it does not create new genetic
information, by breaking existing DNA sequences and rejoining them, DNA
recombination changes the linkage relationships between different genes
and other important regulatory sequences. Without recombination, once
a harmful mutation is formed in a gene, the mutated gene will be perma-
nently linked to other nearby functional genes, and it becomes impossible to
regroup all the functional genes into the same DNA molecule. Through this
regrouping, DNA recombination makes it possible to avoid gradual accumu-
lation of harmful gene mutations. Most DNA recombination events happen
during meiosis in the formation of gametes (sperm or eggs) as part of sexual
reproduction.
when the Human Genome Project was completed in 2003. This has gradu-
ally led to a paradigm shift in disease diagnosis and prevention. As a result,
the public becomes more aware of the role of individual genomic makeup in
disease development and predisposition. In addition, the easier accessibility
to our DNA sequence has further prompted us to look into our genome and
use that information for preemptive disease prevention. The declining cost
of genome sequencing has also enabled the biomedical community to dig
deeper into the genomic underpinnings of diseases, by unraveling the link-
age between sequence polymorphism in the genome and disease incidence.
Following is a brief overview of the major categories of human diseases that
have an intimate connection with DNA mutation, polymorphism, genome
structure, and epigenomic abnormality.
of mentally stimulating activities, and high cholesterol levels are all risk fac-
tors for developing AD. Because of the number of genes involved and their
interactions with nongenetic factors, complex multigene diseases are more
challenging to study than single-gene diseases.
2.9.4 Epigenomic/Epigenetic Diseases
Besides gene mutations and genome instability, abnormal epigenomic/
epigenetic patterns can also lead to diseases. Examples of diseases in
this category include fragile X syndrome, ICF syndrome, Rett syndrome,
and Rubinstein-Taybi syndrome. In ICF syndrome, for example, the gene
DNMT3B is mutated leading to the deficiency of DNA methyltransferase 3B.
Patients afflicted with this disease invariably have DNA hypomethylation,
and have symptoms such as facial anomaly, immunodeficiency, and chro-
mosome instability. Cancer, as a genome disease that is caused by more than
one genetic/genomic factor, is also characterized by abnormal DNA meth-
ylation, including both hypermethylation and hypomethylation. The hyper-
methylation is commonly observed in the promoter CpG islands of tumor
DNA Sequence 33
35
36 Next-Generation Sequencing Data Analysis
3.3.1 DNA Template
To initiate transcription, a gene’s DNA sequence is first exposed through
altering its packing state. In order to transcribe the DNA sequence, the two
DNA strands in the region are first unwound, and only one strand is used as
the template strand for transcription. Since it is complementary to the RNA
transcript in base pairing (A, C, G, and T in the DNA template are tran-
scribed to U, G, C, and A, respectively, in the RNA transcript), this DNA
template strand is also called the antisense or negative (–) strand (Figure 3.1).
The other DNA strand has the same sequence as the mRNA (except with T’s
in DNA being replaced with U’s in RNA) and is called the coding, sense, or
positive (+) strand. It should be noted that either strand of the genomic DNA
can be potentially used as the template, and which strand is used as the
template for a gene depends on the orientation of the gene along the DNA. It
should also be noted that the triplet nucleotide genetic code that determines
how amino acids are assembled in proteins refers to the triplet sequence in
the mRNA sequence.
DNA
5´... A T G A G A A C G T T A G G C ...3´ Sense strand
3´... T A C T C T T G C A A T C C G ...5´ Antisense strand
Transcription
mRNA
5´... A U G A G A A C G U U A G G C ...3´
Translation
Peptide
Met - Arg - Thr - Leu - Gly - ...
FIGURE 3.1
How the two strands of DNA template match the transcribed mRNA in sequence, and the
genetic code in mRNA sequence corresponds to the peptide amino acid sequence.
38 Next-Generation Sequencing Data Analysis
consensus sequence TATAAT. Once reaching the TSS, the sigma factor dis-
associates from the core enzyme. The core RNA polymerase, unlike DNA
polymerase, does not need a primer, but otherwise the enzyme catalyzes the
attachment of nucleotides to the nascent RNA molecule one at a time in the
5′→3′ direction. At a speed of approximately 30 nucleotides/second, the RNA
polymerase slides through the DNA template carrying the elongating RNA
molecule.
Although the attachment of new nucleotides to the elongating RNA is
based on base pairing with the DNA template, the new elongating RNA does
not remain associated with the template DNA via hydrogen bonding. On the
same template, multiple copies of RNA transcripts can be simultaneously
synthesized by multiple RNA polymerases one after another. During tran-
script elongation, these polymerases hold on tightly to the template and do
not disassociate from the template until the stop signal is transcribed. The
stop signal is provided by a segment of palindromic sequence located at the
end of the transcribed sequence. Right after transcription, the inherent self-
complementarity in the palindromic sequence leads to the spontaneous for-
mation of a hairpin structure. An additional stop signal is also provided by
a string of four or more uracil residues after the hairpin structure, which
forms weak associations with the complementary A’s on the DNA template.
The hairpin structure pauses further elongation of the transcript, and the
weak associations between the U’s on the RNA and the A’s on the DNA dis-
sociate the enzyme and the transcript from the template.
Regulation of prokaryotic transcription is conferred by promoters and pro-
tein factors such as repressors and activators. Promoter strength, that is, the
number of transcription initiation events per unit time, varies widely in dif-
ferent operons. For example, in E. coli, genes in operons with weak promoters
can be transcribed once in 10 minutes, while those with strong promoters
can be transcribed 300 times in the same amount of time. The strength of an
operon’s promoter is based on the host cell’s demand for its protein products
and dictated by its sequence. Specific protein factors may also regulate gene
transcription. Repressors, the best known among these factors, prevent RNA
polymerase from initiating transcription through binding to an intervening
sequence between the promoter and TSS called an operator. Activators exert
an opposite effect and induce higher levels of transcription. The sigma fac-
tor, being the initiation factor of the prokaryotic RNA polymerase, provides
another mechanism for regulation. There are different forms of this factor
in prokaryotic cells, each of which mediates sequence-specific transcription.
Differential use of these sigma factors, therefore, provides another level of
transcriptional regulation in prokaryotic cells.
Exon skipping
Intron retention
FIGURE 3.2
Varying forms of RNA transcript splicing.
42 Next-Generation Sequencing Data Analysis
In the third step, once the new primary transcript passes the termination
signal sequence, it is bound by several termination-related proteins. One of
the proteins cleaves the RNA at a short distance downstream of the termina-
tion signal to generate the 3′ end. This is followed by a polyadenylation step
that adds 50 to 200 A’s to the 3′ end by an enzyme called poly-A polymerase.
This poly-A tail, like the 5′ end cap, increases the stability of the resulting
mRNA. This tail is bound and protected by a poly(A)-binding protein, which
also promotes its transport to the cytoplasm.
Besides these three major constitutive processing steps, some transcripts
may undergo additional processing steps. RNA editing, although considered
to be rare, is among the best known of these steps. RNA editing refers to the
change in RNA nucleotide sequence after it is transcribed. The most com-
mon types of RNA editing are conversions from A to I (inosine, read as G
during translation), which are catalyzed by enzymes such as ADARs (ade-
nosine deaminases that act on RNA), or from C to U, catalyzed by cytidine
deaminases. As a result of these conversions, an edited RNA transcript no
longer fully matches the sequence on the template DNA. RNA editing has
the potential to change genetic codons, introduce new or remove existing
stop codons, or alter splicing sites [35]. Evidence shows that RNA editing and
other RNA processing events such as splicing can be coordinated [36].
DNA
1 – Transcriptional control
Pre-mRNA
mRNA
Nucleus
mRNA
4 – Translational control
Protein
Inactive mRNA/
degraded mRNA
FIGURE 3.3
The regulation of eukaryotic gene expression at multiple levels.
3.4.1 Ribozyme
Similar to proteins, RNAs can form complicated three-dimensional struc-
tures, and some RNA molecules carry out catalytic functions. These catalytic
RNAs are called ribozymes. A classic example of a ribozyme is one type
of intron called group I intron, which splices itself out of the pre-mRNA
that contains it. This self-splicing process, involving two transesterification
steps, is not catalyzed by any protein. A group I intron is about 400 nucleo-
tides in length and mostly found in organelles, bacteria, and the nucleus of
lower eukaryotes. When a precursor RNA that contains a group I intron is
incubated in a test tube, the intron splices itself out of the precursor RNA
autonomously. Despite variations in their internal sequences, all group I
introns share a characteristic spatial structure, which provides active sites
for catalyzing the two steps. Another example of ribozyme is the 23S rRNA
contained in the large subunit of the prokaryotic ribosome. This rRNA cat-
alyzes the peptide bond formation between an incoming amino acid and
the existing peptide chain. Although the large subunit contains more than
30 proteins, rRNA is the catalytic component, while the proteins only pro-
vide structural support and stabilization [46].
Also similar to protein catalysts, the dynamics of the reactions catalyzed
by ribozymes follows the same characteristics as those of protein enzyme-
catalyzed reactions, which are usually described by the Michaelis-Menten
equation. Further similarities of ribozymes to protein enzymes include that
ribozyme activity can also be regulated by ligands, usually small molecules,
the binding of which leads to structural change in the ribozyme. For instance,
a ribozyme may contain a riboswitch, which as part of the ribozyme can
bind to a ligand to turn on or off the ribozyme activity.
46 Next-Generation Sequencing Data Analysis
3.4.4.1 miRNA
Mature miRNA, in the size range of 19 to 24 bases, induces gene silencing
through mRNA translational repression or decay. The precursor of miRNA
is usually transcribed from non-protein-coding genes in the genome
(Figure 3.4). The primary transcript, called pri-miRNA, contains an inter-
nal hairpin structure and is much longer than mature miRNA. For initial
processing, the pri-miRNA is first trimmed in the nucleus by a ribonuclease
called Drosha that exists as part of a protein complex called the microproces-
sor, to an intermediate molecule called pre-miRNA, about 70 nucleotides in
size. Alternatively, some miRNA precursors originate from introns spliced
out from protein-coding transcripts. These precursors, to be processed
for the generation of mirtrons (miRNAs derived from introns), bypass the
microprocessor complex in the nucleus. For further processing, the pre-
miRNA and the mirtron precursor are exported out of the nucleus into the
cytoplasm, where they are cleaved by the endoribonuclease Dicer to form
double-stranded miRNA. The double-stranded miRNA is subsequently
loaded into RISC. Argonaute, the core protein component of RISC, unwinds
the two miRNA strands and discard one of them [49]. The remaining strand
is used by Argonaute as the guide sequence to identify related mRNA targets
through imperfect base pairing with a seed sequence usually located in the
3′-UTR of mRNAs. Through this miRNA–mRNA interaction, RISC induces
silencing of target genes through repressing translation of the mRNAs and/ or
48 Next-Generation Sequencing Data Analysis
miRNA
gene
Pri-miRNA
Drosha
Pre-miRNA
Exportin 5
Nucleus
Cytoplasm dsRNA
Dicer
miRNA: siRNA
miRNA* duplex
duplex Unwind
RISC RISC
Ribosome
Target
mRNA
ORF RISC RISC RISC
Translational repression mRNA cleavage
FIGURE 3.4
The generation and functioning of miRNA and siRNA in suppressing target mRNA activity.
Genomic regions that code for miRNAs are first transcribed into pri-miRNAs, which are pro-
cessed into smaller pre-miRNAs in the nucleus by Drosha. The pre-miRNAs are then trans-
ported by exportin 5 into the cytoplasm, where they are further reduced to miRNA:miRNA*
duplex by Dicer. While both strands of the duplex can be functional, only one strand is assem-
bled into the RNA-induced silencing complex (RISC), which induces translational repression
or cleavage of target mRNAs. Long double-stranded RNA can also be processed by Dicer to
generate siRNA duplex, which also uses RISC to break down target mRNA molecules. (From
L He and GJ Hannon, MicroRNAs: Small RNAs with a big role in gene regulation, Nature
Reviews Genetics 2004, 5:522–531. With permission.)
RNA 49
3.4.4.2 siRNA
While being similar in size and using basically the same system for gene
silencing, siRNA differs from miRNA in a number of aspects. On origin,
siRNA is usually exogenously introduced, such as from viral invasion or
artificial injection. But they can also be generated endogenously, for example,
from repeat-sequence-generated transcripts (such as those from telomeres or
transposons), or RNAs synthesized from convergent transcription (in which
both strands of a DNA sequence are transcribed from the two opposite orien-
tations with corresponding promoters), or other naturally occurring sense–
antisense transcript pairs [50]. To generate mature siRNA, exogenously
introduced double-stranded RNA, or endogenously transcribed precursor
that is transported from the nucleus to the cytoplasm, is cleaved by Dicer.
The mature siRNA is then loaded into RISC for silencing target mRNAs by
Argonaute. On target mRNA identification, siRNA differs from miRNA in
that it has perfect or nearly perfect sequence complementarity with their tar-
get. On the mechanism of gene silencing, siRNA usually leads to endonu-
cleolytic cleavage, also called slicing, of the mRNAs.
3.4.4.3 piRNA
piRNA is a relatively newer class of small noncoding RNAs between 24 and
31 nucleotides in length and have functions in animal germline tissues. While
using a similar basic RNAi mechanism, piRNA is different from miRNA and
siRNA in two major aspects. One is that its biogenesis does not involve Dicer,
and the other is that, for gene silencing, it specifically interacts with Piwi, a
different clade of Argonaute proteins. The biogenesis of piRNA starts from
transcription of long RNAs from specific loci of the genome called piRNA
clusters. With regard to these clusters, it has been found that while their
locations in the genome do not show much change in related species, their
sequences are not conserved even in closely related species. After transcrip-
tion, the RNAs are transported out of the nucleus, and then subjected to
a parsing process by endonuclease(s) that is currently not clearly known.
To induce gene silencing, mature piRNA is loaded into RISC that contains
Piwi, which uses the piRNA sequence as a guide to silence target mRNAs
by slicing. In addition, piRNA-loaded mature RISC can also be transported
into the nucleus, where it finds and silences target mRNAs that are still
in the process of being transcribed. This transcriptional gene silencing is
achieved through interactions with other protein factors in the nucleus, and
histone modification that alters chromatin structure and gene access. The
currently best-known function of piRNAs, through post-transcriptional and
50 Next-Generation Sequencing Data Analysis
Introduction to Next-
Generation Sequencing
(NGS) and NGS
Data Analysis
4
Next-Generation Sequencing (NGS)
Technologies: Ins and Outs
55
56 Next-Generation Sequencing Data Analysis
T G C A G G C A T C A G
Denatured G T C Labeled
template primer
A C G T C C G T A G T C
(a)
Denaturing gel
Labeled strands
G A T C
ddA C G T C C G T A G T C
A
ddC
C
ddG
G
ddT
T
ddC
C
ddC
C
ddG
G
ddT
T
ddA
A
(b)
FIGURE 4.1
The Sanger sequencing method as originally proposed. This method involves a step for new
DNA strand synthesis using the sequencing target DNA as the template (panel a), followed by
sequence deduction through resolution of the newly synthesized DNA strands (panel b). In the
first step, the new strand synthesis reaction contains denatured DNA template, radioactively
labeled primer, DNA polymerase, and dNTPs. For the dNTPs, the Sanger method is character-
ized by the use of dideoxynucleotides (ddG, ddA, ddT, and ddC, as shown in the figure) along
with regular unmodified nucleotides. The DNA polymerase in the reaction incorporates dide-
oxynucleotides into the elongating DNA strand like regular nucleotides, but once a dideoxy-
nucleotide is incorporated, the strand elongation terminates. In this sequencing scheme, each
of the four dideoxynucleotides is run in a separate reaction, and the ratio of these dideoxynu-
cleotides to their regular counterparts in each reaction is controlled so that the polymerization
can randomly terminate at each base position. The end product of each reaction is a popula-
tion of DNA fragments with different lengths, with the length of each fragment dependent on
where the dideoxynucleotide is incorporated. Panel b illustrates the separation of these frag-
ments in a denaturing gel by electrophoresis, in which smaller fragments migrate faster than
larger ones and appear toward the bottom of the gel. The radioactive labeling on the primer
enables visualization of the fragments as bands on the gel. Shown on the right are DNA frag-
ments that correspond to each of the bands, respectively. From the arrangement of DNA bands
the complementary sequence of the original DNA template can be deduced (shown on the left
of the sequencing gel, read from bottom upward). (From P Moran, Overview of commonly
used DNA techniques, in LK Park, P Moran, and RS Waples, eds., Application of DNA Technology
to the Management of Pacific Salmon, 1994, 15–26, Department of Commerce, NOAA Technical
Memorandum NMFS-NWFSC-17. © Paul Moran, NOAA’s Northwest Fisheries Science Center.
With permission.)
Next-Generation Sequencing (NGS) Technologies 57
lowering sequencing cost, largely due to the segregation of its DNA synthe-
sis process and the subsequent DNA chain separation/detection process.
Its principle of sequencing-by-synthesis, however, becomes the basis of sev-
eral NGS technologies, all of which are characterized by extremely high
throughput. These technologies generally use nucleotides with reversible
terminator or other cleavable chemical modifications, or regular unmodi-
fied nucleotides, so the new DNA strand synthesis is not permanently
terminated and therefore can be monitored as or after each base is incorpo-
rated. This development, along with advancements in other relevant fields,
makes it possible to conduct sequencing of millions of DNA fragments
simultaneously.
One of the early NGS technologies, 454 (later acquired by Roche), achieved
high-throughput sequencing by further developing a method called pyro-
sequencing. This method is based on the detection of the pyrophosphate
released after each nucleotide incorporation in the new DNA strand synthe-
sis [66]. In this technology, each of the four different nucleotides is added
into the sequencing reaction in a fixed order one at a time. If complemen-
tary, the corresponding nucleotide (or more than one, if there is a homopol-
ymer on the template) is ligated to the new strand, and as part of the ligation
reaction a pyrophosphate is released as a side product. An enzyme called
ATP sulfurylase converts this pyrophosphate to ATP, which in turn is used
to convert luciferin to oxyluciferin by luciferase. The generated oxyluciferin
emits light, and the amount of light emitted is generally proportional to
the number of nucleotides incorporated. By detecting light emission after
each cycle of nucleotide addition, the sequence on each DNA template is
deduced. High throughput is achieved when massive numbers of DNA
templates are sequenced in this fashion simultaneously. Using the 454/
Roche technology, sequence reads of 400 to 500 bp in length are generated.
A number of widely used NGS technologies, including Illumina reversible
dye-terminator sequencing, Ion Torrent semiconductor sequencing, and
Pacific Biosciences single-molecule real-time sequencing, are all based on
the sequencing-by-synthesis principle. The specifics of these methods will
be detailed in Section 4.3.
Not all NGS technologies are based on the principle of sequencing-by-
synthesis. For example, the SOLiD (Sequencing by Oligonucleotide Ligation
and Detection) system from Life Technologies uses a sequencing-by-ligation
process. Nanopore sequencing, as commercialized by companies such as
Oxford Nanopore Technologies, deduces the DNA sequence through detect-
ing differential electric field disturbances caused by different nucleotides
when a strand of DNA is threaded through a nanopore structure. Despite the
differences in how different NGS technologies work in principle, the overall
workflow of an NGS experiment is more or less similar. Next is an overview
of a typical NGS experimental workflow as conducted in a wet lab, along
with early-phase data analysis.
58 Next-Generation Sequencing Data Analysis
Sequencing target
(DNA or RNA)
Fragmentation
Size selection
Adapter ligation
Immobilization to
a solid support
Sequencing using
NGS technologies
G A T C A A T
A T A T C A A
A T C T A G A Sequencing reads
A G G T A T C
A T A T C A A
C A A T G T C
A G G T A T C
A T C T A G A
A T A T C A A
A T A T C A A Alignment (or assembly)
G A T C A A T of sequencing reads
C A A T G T C
A G G T A T C T A G A T C A A T G T C
FIGURE 4.2
The general workflow of an NGS experiment.
60 Next-Generation Sequencing Data Analysis
4.3.1.2 Implementation
The sequencing reaction in an Illumina NGS system takes place in a flow cell
(Figure 4.3). The microfluidic channels in the flow cell, often called lanes, are
where sequencing reactions take place and sequencing signals are collected
through scanning. The top and bottom surface of each lane is covered with
a lawn of oligonucleotide sequences that are complementary to the anchor
sequences in the ligated adapters. When sequencing libraries are loaded into
each of the lanes, DNA templates in the libraries bind to these oligonucle-
otide sequences and become immobilized onto the lane surface (Figure 4.4).
After immobilization, each template molecule is clonally amplified through
a process called “bridge amplification,” through which up to 1000 identical
copies of the template are generated in close proximity (<1 micron in diam-
eter) forming a cluster. During sequencing, these clusters are basic detection
units, which generate enough signal intensity for base calling.
Under ideal conditions the simultaneous incorporation of nucleotides
to the many identical copies of sequencing templates in a single cluster is
expected to be in synchronization from step to step and therefore remain in
62 Next-Generation Sequencing Data Analysis
FIGURE 4.3
An Illumina sequencing flow cell. It is a special glass slide that contains fluidic channels inside
(called lanes). Sequencing libraries are loaded into the lanes for massively parallel sequencing
after template immobilization and cluster generation. In each step of the sequencing process,
a DNA synthesis mixture, including DNA polymerases and modified dNTPs, is pumped into
and out of each of the lanes through their inlet and outlet ports located at the two ends.
Next-Generation Sequencing (NGS) Technologies 63
(a)
(b)
(c)
B B
B
B
B
(d)
FIGURE 4.4
Illumina sequencing sample preparation and sequencing approaches. (a) Ligation of adapters
in a forked configuration to fragmented and size-selected DNA. Each ligation product is then
amplified using primers complementary to sequences in the adapters to generate blunt-end
amplicons. (b) Clonal amplification by isothermal bridge amplification. To achieve this, DNA
strands are first separated by denaturing, and each strand is attached to the flow cell surface
with complementary sequences. After a new strand (dotted line) is synthesized on the flow cell
surface, the original DNA strand is removed. The “free” end of the new strand then attaches to
the other anchoring oligonucleotide sequence on the flow cell surface with sequence comple-
mentarity, which leads to the formation of a bridge configuration for synthesis of a new com-
plementary strand. Repetition of this process generates many copies of the original sequence
template in a cluster. (c) Sequencing from one end or both ends of the DNA templates. Prior to
sequencing, one strand is cleaved within one adapter sequence (marked with an asterisk) and
then removed after denaturing. The remaining strand is used as the template for sequencing
from one end. For sequencing from the other end (i.e., paired sequencing), the sequencing tem-
plate is rebuilt. The template rebuilding process includes removal of the first read’s new strand
after denaturing, generation of a complementary template (dotted line) with the bridge synthe-
sis, and cleavage and removal of the original template. The new template is used to generate
read 2 from the opposite end. (d) Mate-pair sequencing. This strategy enables sequencing of
the two ends of long DNA fragments (e.g., >1 kb). In this strategy, the long DNA fragments are
first circularized and then fragmented. Those fragments that contain the end junction are then
sequenced using the paired-end process illustrated in b. (From DR Bentley, S Balasubramanian,
HP Swerdlow, GP Smith, J Milton, CG Brown, KP Hall et al., Accurate whole human genome
sequencing using reversible terminator chemistry, Nature 2008, 456:53–59. With permission.)
64 Next-Generation Sequencing Data Analysis
phase. In reality, a small percentage of templates lose sync with the major-
ity of molecules in the same cluster, leading to either falling behind (called
phasing), due to incomplete removal of the terminator as well as missing a
cycle, or being one or several bases ahead (prephasing), due to incorporation
of nucleotides with no terminators. The existence of phasing and prephas-
ing in a cluster leads to increased background noise and decreased base call
quality. When more and more sequencing cycles are conducted, this problem
becomes worse. This is why platforms that are based on clonal amplification
(which also include the Ion Torrent platform to be detailed next) have declin-
ing base call quality scores toward the end. Eventually the decrease in base
call quality reaches a threshold beyond which the quality scores become
simply unacceptable. The gradual loss of synchronicity is a major determi-
nant of read length for these platforms.
There are three steps in the Illumina sequence data generation process.
First, raw images captured after each cycle are analyzed to locate clusters
and report signal intensity, coordinates, and noise level for each cluster. This
step is conducted by the instrument control software. The output from this
step is fed into the next step of base calling by the instrument Real Time
Analysis (RTA) software, which uses cluster signal intensity and noise level
to make base calls and quality score calculation. This step also filters out low-
quality reads. In the third step, the base call files, or bcl files, are converted to
compressed FASTQ files by the Illumina’s proprietary software CASAVA. If
samples are indexed and sequenced in a multiplex fashion, demultiplexing
of the sequence data is also performed in the third step. The compressed and
demultiplexed FASTQ files are what an end user receives from an NGS core
facility after the completion of a run.
4.3.2.2 Implementation
The library construction process in this technology is similar to other NGS
technologies, involving ligation of platform-specific primers to DNA shot-
gun fragments. The library fragments are then clonally amplified by emul-
sion PCR onto the surface of 3-micron diameter beads. The microbeads
coated with the amplified sequence templates are then deposited into an Ion
chip. Each Ion chip has a liquid flow chamber that allows influx and efflux
of native nucleotides (introduced one at a time), along with DNA polymerase
and buffer that are needed in the sequencing-by-synthesis process. For
66 Next-Generation Sequencing Data Analysis
4.3.3.2 Implementation
The single-molecule sensitivity of this technology is achieved by the use
of zero-mode waveguide (or ZMW), a hole tens of nanometers in diam-
eter microfabricated in a metal film of 100 nm thickness, which is in turn
deposited onto a glass substrate. To conduct sequencing in a ZMW, a single
DNA polymerase and a single DNA strand (sequencing target) are immo-
bilized to its bottom. Because the diameter of a ZMW is smaller than the
wavelength of visible light, and due to the natural behavior of visible light
passing through such a small opening from the glass bottom, only the bot-
tom 30 nm of the ZMW is illuminated. Having a detection volume of only
20 zeptoliters (10 –21 L), this detection scheme greatly reduces background
noise and enables detection of nucleotide incorporation into a single DNA
molecule.
While the SMRT platform performs single molecule sequencing, it still
requires a lot of DNA samples at the starting point (1 μg currently). The
library prep process for SMRT is similar to other NGS platforms, includ-
ing shotgun fragmentation of DNA into required size, which is multi-
kilobases for this technology. This is followed by DNA fragment end
repair and adapter ligation. The resultant sequencing templates are then
annealed to sequencing primers, onto which DNA polymerases are sub-
sequently bound. Prior to sequencing, the template–primer–polymerase
complex is immobilized to the 150,000 ZMWs at the bottom of a SMRT
cell. Due to technical restraints, currently only about one third (~50,000 to
60,000) of these ZMWs generate quality reads that pass filter.
TABLE 4.1
Comparison of Current NGS Platforms
Read Data Output Cost Common Error Error Paired-End Required DNA
Platform Principle Detection Length per Run per Gbc Type Ratea Sequencing Sample Amount
HiSeq Reversible Fluorescence 125–250 1000 Gb $ Single nucleotide 10–3 Yes 50–1000 ng
2500 Terminator basesb substitution
MiSeq Reversible Fluorescence 300 15 Gb $$ Single nucleotide 10–3 Yes 50–1000 ng
Terminator bases substitution
Ion Proton pH change 400 Up to 2 Gb $$$ Indels (mostly at 10–2 Yes 100–1000 ng
Torrent release and bases homopolymers)
PGM pH change
Ion Proton pH change 200 10 Gb $$$$ Indels (mostly at 10–2 Yes 100–1000 ng
Proton release and bases (PI Chip) homopolymers)
pH change
PacBio ZMW and Fluorescence 8.5 kb 375 Mb $$$$$ Indels 10–1 No 1000 ng
RSII single average
molecule
sequencing
a Source: CW Fuller, LR Middendorf, SA Benner, GM Church, T Harris, X Huang, SB Jovanovich et al., The challenges of sequencing by synthesis, Nature
Biotechnology 2009, 27:1013–1023. With some modifications.
b 125 bases: high-output mode; 250 bases: rapid mode.
c Relative sequencing cost: $, least expensive; $$$$$, most expensive.
Next-Generation Sequencing Data Analysis
Next-Generation Sequencing (NGS) Technologies 69
4.5.6 Metagenomics
To study a community of microorganisms like the microbiome in the gut
or those in a bucket of seawater, where extremely large but unknown num-
bers of species are present, a brutal force approach that involves the study of
all genomes contained in such a community is metagenomics. Recently, the
field of metagenomics has been greatly fueled by the development of NGS
technologies. By quickly sequencing everything in a metagenome, research-
ers can get a comprehensive profile of the makeup and functional state of
a microbial community. Compared to NGS data generated from a single
genome, the metagenomics data is much more complicated. Chapter 13
focuses on metagenomics NGS data analysis.
5
Early-Stage Next-Generation Sequencing
(NGS) Data Analysis: Common Steps
73
74 Next-Generation Sequencing Data Analysis
Base calling
Data QC and
preprocessing
FIGURE 5.1
General overview of NGS data analysis. The steps in the dashed box are common steps con-
ducted in primary and secondary analysis.
where PErr is the probability of making a base call error. Based on this equa-
tion, a 1% chance of incorrectly calling a base is equivalent to a Q-score of
20, and Q30 means a 1/1000 chance of making a wrong call. Usually for a
base call to be reliable, it has to have a Q-score of at least 20. High-quality
calls have Q-scores above 30, usually up to 40. For better visualization of
@HISEQ:131:C5NWFACXX:1:1101:3848:2428 1:N:0:CGAGGCTGCTCTCTAT
CTTTTATCAGACATATTTCTTAGGTTTGAGGGGGAATGCTGGAGATTGTAATGGGTATGGAGACATATCATATAAGTAATGCTAGG
GTGAGTGGTAGGAAG
+
BB7FFFFB<F<FBFBBFBFBFFFIFFFFIIIFF<FBFFFBFIFFBFFFIFFFBFB07<BFFF7BBFFFBFFFFFF<BFBFBBBBBB
B'77B<770<BBBBB
FIGURE 5.2
The FASTQ sequence read report format. Shown here is one read generated from an NGS experiment. A FASTQ file usually contains millions of such
reads, with each containing several lines as shown here. Line 1, starting with the symbol “@,” contains sequence ID and descriptor. Line 2 is the read
sequence. Line 3 (optional) starts with the “+” symbol, which may be followed by the sequence ID and description. Line 4 lists confidence (or quality)
scores for each corresponding base in the read sequence (Line 2). For Illumina-generated FASTQ files, the sequence ID in line 1 in its current version
basically identifies where the sequence was generated. This information include the equipment (“HISEQ” in the example), sequence run ID (“131”),
flow cell ID (“C5NWFACXX”), flow cell lane (“1”), tile number within the lane (“1101”), x-y coordinates of the sequence cluster within the tile (“3848”
and “2848”, respectively). The ensuing descriptor contains information about the read number (“1” is for the single read here; for the paired-end
read it can be 1 or 2), whether the read is filtered (“N” here means it is not filtered), control number (“0”), and index (or sample barcode) sequence
(“CGAGGCTGCTCTCTAT”).
Early-Stage Next-Generation Sequencing (NGS) Data Analysis
75
76 Next-Generation Sequencing Data Analysis
; < = > ? @ A B C D E F G H I J
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
FIGURE 5.3
Encoding of base call quality scores with ASCII characters. ASCII stands for American
Standard Code for Information Interchange, and an ASCII code is the numerical representa-
tion of a character in computers (e.g., the ASCII code of the letter B is 66). In this encoding
scheme, the ASCII character codes are equal to Q-scores plus 33. Current major NGS platforms,
including Illumina (after version 1.8), use this encoding scheme for Q-score representation.
Q-scores associated with their corresponding base calls, they are usually
encoded with ASCII characters. Although there have been different encod-
ing scheme versions (e.g., Illumina 1.0, 1.3, and 1.5), the NGS field has mostly
settled on the use of the same encoding scheme used by Sanger sequencing
(Figure 5.3). In the FASTQ example shown in Figure 5.2, the first base, C, has
an encoded Q-score of B (i.e., 33).
To come up with the PErr, a control lane or spike control is usually used
to generate a base call score calibration table in Illumina sequencing for
lookup. A precomputed calibration table can also be used in the absence
of a control lane and spike control. Because each platform calibrates its
Q-scores differently, if they are to be compared with each other or ana-
lyzed in an integrated fashion, their Q-scores need to be recalibrated. To
carry out the recalibration, a subset of reads is used that map to regions
of the reference genome that contain no SNPs, and any mismatch between
the reads and the reference sequence is considered a sequencing error.
Based on the rate of mismatch at each base position of the reads, a new
calibration table is constructed, which is then used for recalibration. Even
without cross-platform NGS data comparison and integration, NGS data
generated on the same platform can still be recalibrated postmapping
(see Section 5.3) using the same approach, which often leads to improved
base call quality scores.
5.3 Reads Mapping
After the data is cleaned up, the next step is to map, or align, the reads to a
reference genome if it is available, or conduct de novo assembly. As shown
in Figure 5.1, most NGS applications require reads mapping to a reference
genome prior to conducting further analysis. The purpose of this map-
ping process is to locate origins of the reads in the genome. Compared to
searching for the location(s) of a single or a small number of sequences in a
genome by tools such as BLAST, simultaneous mapping of millions of NGS
reads, sometimes very short, to a genome is not trivial. A further challenge
comes from the fact that any particular genome from which NGS reads are
derived deviates from the reference genome at many sites because of poly-
morphism and mutation. As a result any algorithm built for this task needs
to accommodate such sequence deviations. To further complicate the situ-
ation, sequencing errors are often indistinguishable from true sequence
deviations.
Position N
Burrows-Wheeler
Position 2 transform and indexing
CTGC CGTAAACTAATG
Bowtie index
Position 1
(~2 gigabytes)
ACTG CCGTAAACTAAT ACTC CCGT ACTC TAAT ACTCCCGTACTCTAAT
ACTG * * * * AAAC * * * * 1 T
* * * * CCGT * * * * TAAT Six seed 2 Look up AT
ACTG * * * * * * * * TAAT pairs per 3 “suffixes” AAT
* * * * * * * * AAAC TAAT read/ 4 of read
ACTG CCGT * * * * * * * * fragment 5
* * * * CCGTAAAC * * * * 6
ACTCCCGTACTCTAAT
Index seed pairs Hits identify
positions in
Seed index genome where
(tens of gigabytes) Look up each pair read is found
of seeds in index
ACTG * * * * AAAC * * * *
Hits identify positions
in genome where
spaced seed pair
is found
* * * * CCGT * * * * TAAT
ACTG * * * * * * * * TAAT Confirm hits Convert each
* * * * CCGTAAAC * * * * by checking hit back to
“****” positions genome location
FIGURE 5.4
Two NGS reads mapping approaches. (a) The approach based on spaced seed indexing. In this
illustration, spaced seeds extracted from the reference genome sequence are indexed. Some
mapping algorithms like MAQ and ELAN index reads (usually in batches) instead. (b) A newer
approach developed on the basis of the Burrows-Wheeler transform. In this example, the algo-
rithm Bowtie performs mapping by looking up reads base by base, from right to left, against
the transformed and indexed genome. (Modified from C Trapnell, SL Salzberg, How to map
billions of short reads onto genomes, Nature Biotechnology 2009, 27:455–457. With permission.)
such as Novoalign, Stampy [85], and SHRiMP2 [86] are often used. Most of
these aligners were initially developed to map very short reads, such as those
of 35 nucleotides from early Illumina sequencers. With the gradual increase in
read length, these aligners have been adapted accordingly; for instance, BWA-
MEM is a recent adaptation of the original BWA algorithm for aligning longer
reads [87].
For aligning much longer reads such as those from the Pacific Biosciences
SMRT platform, aligners designed to handle long sequences, such as BLASR
[88], LAST [89], LASTZ [90], or BWA-MEM, should be used. Among these
long reads aligners, LAST, which uses adaptive seeds instead of fixed-
length seeds and suffix array indexing to achieve speed and sensitivity, and
LASTZ, which uses the more traditional seed-and-extend approach like
BLAST, are whole genome alignment tools originally designed for genome-
scale comparisons and therefore can conduct pairwise alignment on very
long sequences. BLASR is designed for aligning long reads generated from a
single DNA molecule like those from the SMRT system. It conducts mapping
of such reads through combining short read mapping data structures and
alignment methods used by whole genome alignment tools.
Besides mapping algorithms, selection of reference genome sequences,
when multiple reference genome sequences are available, also affects map-
ping result. By the design of most current mappers, reads that are more
similar to the selected reference sequence align better than those that devi-
ate more from the reference. If the deviation is sufficiently large, it might be
discarded as a mismatch. As a result, the use of different reference genome
sequences can introduce a “reference bias.” The use of any one particular ref-
erence genome invariably introduces this bias, as a single reference genome
simply cannot accommodate sequence variations and polymorphisms that
are naturally present in a population or species. This bias should be kept in
mind though, especially when the genetic background of the source organ-
ism is different from the reference genome. In this situation, comparison of
mapping results from the use of different references can help select a ref-
erence that is more appropriate. Alternatively, some more recent mapping
algorithms, such as GenomeMapper [91], have the capability of using mul-
tiple genome references simultaneously as a reference.
TABLE 5.1
Mandatory Fields in the SAM/BAM Alignment Section
Col Field Type Description
1 QNAME String Query sequence read (or template) NAME
2 FLAG Integer Bitwise FLAG
3 RNAME String Reference sequence NAME
4 POS Integer Leftmost mapping POSition on the reference sequence
5 MAPQ Integer MAPping Quality
6 CIGAR String CIGAR string
7 RNEXT String Reference name of the NEXT read (For paired-end reads)
8 PNEXT Integer Position of the NEXT read (For paired-end reads)
9 TLEN Integer Observed Template LENgth
10 SEQ String Segment SEQuence
11 QUAL String ASCII of Phred-scaled base QUALity+33
if it exists, provides generic information about the SAM/BAM file and is
placed above the alignment section. Each line in the header section starts
with the symbol “@.” For the alignment section there are 11 mandatory fields
(listed in Table 5.1). An example of the SAM/BAM format is presented in
Figure 5.5.
In the example shown in Figure 5.5, the header section contains two
lines. The first line has the two-letter record type code HD, signifying it
as the header line, which is always the first line if present. This record
has two tags: VN, for format version, and SO, for sorting order (in this
case the alignments are sorted by coordinate). The second line is for SQ,
which is the reference sequence dictionary. It also has two tags SN and
LN, for reference sequence name and reference sequence length, respec-
tively. For the alignment section, while most of the fields listed on Table
5.1 are self-explanatory, some fields may not be so clear at first glance. The
FLAG field uses a simple decimal number to track the status of 11 flags
used in the mapping process, such as whether there are multiple segments
in the sequencing (like r001 in the example) or if the SEQ is reverse comple-
mented. To check on the status and meaning of these flags, the decimal
number needs to be converted to its binary counterpart. For the POS field,
SAM uses a 1-based coordinate system, that is, the first base of the refer-
ence sequence is counted as 1 (instead of 0). The MAPQ is the mapping
quality score, which is calculated similarly to the Q-score introduced ear-
lier (MAPQ = –10 × log10(PMapErr)). The CIGAR field describes in detail how
the SEQ maps to the reference sequence, with the marking of additional
bases in the SEQ that are not present in the reference, or missing reference
bases in the SEQ. In the earlier example, the CIGAR field for r001/1 shows a
value of “8M2I4M1D3M,” which means the first 8 bases matching the refer-
ence, the next 2 bases being insertions, the next 4 matching the reference,
Early-Stage Next-Generation Sequencing (NGS) Data Analysis 83
+r001/1 CGAAGGTACTATGA*ATG
+r002 cggAAGGTA*TATGA
+r003 TGACAT..............TACCG
-r001/2 ACCGCGACA
(a)
FIGURE 5.5
The SAM/BAM format for storing NGS reads alignment results. The alignment shown in
panel (a) is captured by the SAM format shown in panel (b). In panel (a), the reference sequence
is shown on the top with the corresponding coordinates. Among the sequences derived from
it, r001/1 and r001/2 are paired reads. The bases in lowercase in r002 do not match the refer-
ence and as a result are clipped in the alignment process. The read r003 represents a spliced
alignment. In panel b, the SAM format contains 11 mandatory fields that are explained in more
detail in Table 5.1.
the next 1 being a deletion, and the last 3 again being matches. For more
details (such as those on the different FLAG status) and full specification of
the SAM/BAM format, please refer to the documentation from the SAM/
BAM Format Specification Working Group.
Second, reads that map to multiple genomic locations, often called multi
reads, usually do not contribute to subsequent analysis and therefore are
filtered out. The ambiguity in the mapping of multireads is due to the
aforementioned sequence deviation caused by polymorphism and muta-
tion, sequencing error, and the existence of highly similar sequences in
the genome such as those from duplicated genes. Inclusion of these reads
in downstream analysis may lead to biased or erroneous results. For most
experiments, these reads should be excluded from further analysis. As fil-
tering of multireads usually removes a significant number of reads, which
may lead to potential loss of information, there are some algorithms (such
as BM-Map [92]) that are designed to reuse multireads by probabilistically
allocating them to competing genomic loci.
Third, besides multireads, duplicate reads should also be identified and
filtered out for many experiments. In a diverse nonenriched sequenc-
ing library, because of the randomness of the fragmentation process, the
chance of getting identical fragments is extremely low. Even with a PCR
step to enrich DNA fragments, the chance of generating duplicate reads is
still very low (usually <5%), as the number of cycles in the PCR process is
limited and the subsequent sequencing process is a random sampling of
the DNA library (to varying depth). The existence of excessive numbers of
duplicate reads, therefore, suggests PCR overamplification. Duplicate reads
can be detected based on sequence identity, but due to sequencing errors
this tend to underestimate the amount of duplicate reads. It is more appro-
priate, therefore, to detect duplicate reads after the mapping step (Figure
5.6). Because technical duplicates caused by PCR overamplification and
true biological duplicates are indistinguishable, researchers should exert
caution when making decisions on whether to remove duplicate reads from
further analysis. Although removing duplicate reads can lead to increased
performance in subsequent analysis in many cases (such as variant dis-
covery), in circumstances that involve less complex or mostly enriched
sequencing targets, including those from an extremely small genome or
those used in RNA-Seq or ChIP-Seq, removing them can lead to loss of true
biological information.
Furthermore, a variety of other steps can also be conducted to operate SAM/
BAM files. These steps are usually provided by SAMtools and Picard, two
widely used packages for operating SAM/BAM files. These operations include
FIGURE 5.6
Detection of duplicate reads after the mapping process. Depth of coverage of the reference
genomic region is shown on the top. Mapped reads, along with a set of duplicate reads that map
to the same area, are shown underneath. The light and dark gray colors denote the two DNA
strands. (Generated with CLC Genomics Workbench and used with permission from CLC bio.)
SAMtools and Picard are very versatile in handling and analyzing SAM/
BAM files. In fact, the steps mentioned earlier, that is, generation of align-
ment summary statistics and removal of multireads and duplicate reads,
can be directly conducted with these tools. For example, both SAMtools and
Picard have utilities to detect and remove duplicate reads called rmdup and
markduplicates, respectively. These utilities mark reads that are mapped to
the same starting genomic locations as duplicates.
Last, in terms of examining mapping results, nothing can replace direct
visualization of the mapped reads in the context of the reference genome.
While a text-based alignment viewer, such as that provided by SAMtools,
offers a simple way to examine a small genomic region, direct graphical
86 Next-Generation Sequencing Data Analysis
FIGURE 5.7
The pileup file format as generated from SAMtools. A pileup file shows how sequenced bases
in mapped reads align with the reference sequence at each genomic coordinate. The columns
are (from left to right) chromosome (or reference name), genomic coordinate (1-based), refer-
ence base, total number of reads mapped to the base position, read bases, and their call quali-
ties. In the read bases column, a dot signifies a match to the reference base, a comma to the
complementary strand, and “AGCT” are mismatches. Additionally, the “$” symbol marks the
end of a read, while “^” marks the start of a read, and the character after the “^” represents
mapping quality.
5.4 Tertiary Analysis
After the sequence reads mapping step, subsequent analyses vary greatly
with application. For example, the workflow for RNA-Seq data analysis is
different from that for mutation and variant discovery. Therefore, it is not
possible to provide a “typical” workflow for all NGS data analyses in this
chapter beyond the common steps of data QC, preprocessing, and reads
mapping. Chapters in Section III provide details on application-specific ter-
tiary analytic steps and commonly used tools.
6
Computing Needs for Next-Generation
Sequencing (NGS) Data Management
and Analysis
The gap between our ability to pump out next-generation sequencing (NGS)
data and our capability to extract knowledge from these data is getting
broader. To manage and process the torrent of NGS data for deep understand-
ing of biological systems, significant investment in computational infrastruc-
ture and analytical power is needed. How to gauge computing needs and
build a system to meet the needs, however, poses serious challenges to small
research groups and even large research organizations. To meet this unprec-
edented challenge, the NGS field can borrow solutions from other “big data”
fields such as high-energy particle physics, climatology, and social media. For
biologists without much training in bioinformatics, while getting expert help
is needed, having a good understanding of the various aspects of NGS data
management and analysis will be beneficial for years to come.
87
88 Next-Generation Sequencing Data Analysis
such as scanned images or movies are in the scale of terabytes from a single
run (this amount is not counted in the data volume mentioned earlier). As
these raw signal files accumulate, they can easily overwhelm most data stor-
age systems. While these raw images files can be retained long term, newer
sequencing systems process them on-the-fly and delete them by default once
they have been analyzed to alleviate the burden of storing them. Oftentimes
it is easier and more economical to rerun the samples in case of data loss
rather than archiving these huge raw signal files.
Due to the huge size of most NGS files, transferring them from one place
to another is nontrivial. For a small-sized project to transfer sequencing
files from a production server to a local storage space, download via FTP
or HTTP might be adequate if a fast network connection is available. As for
network speed, a 1 Gbps network is essential, while a 10 Gbps network offers
improved performance for high-traffic conditions. When the network speed
is slow or the amount of data to be transferred is too large, the use of an
external hard drive might be the only option. When the data reaches the lab,
for fast local file reading, writing, and processing, they need to be stored in a
hard drive array inside a dedicated workstation or server.
For a production environment, such as an NGS core facility or a large
genome center, that generates NGS data for a large number of projects,
enterprise- level data storage system, such as Directly Attached Storage
(DAS), Storage Area Network (SAN), or Network Attached Storage (NAS), is
required to provide centralized data repositories with high reliability, access
speed, and security. To avoid accidental data loss, these data storage sys-
tems are usually backed up, mirrored, or synced to data servers distributed
at separate locations. For large-scale collaborative projects that involve mul-
tiple sites and petabytes to exabytes of data, the processes of data transfer
and sharing pose more challenges, which prompt the development of high-
capacity and high-performance platforms such as Globus.
Data sharing among collaborating groups creates additional technical issues
beyond those dealt with by individual labs. A centralized data repository
might be preferred over simple data replication at multiple sites to foster
effective collaboration and timely discussion. Associated with data shar-
ing also comes the issues of data access control, and privacy for data gen-
erated from patient-oriented studies. In a broader sense, NGS data sharing
with the entire life science community also increases the value of a research
project. For this reason, many journals enforce a data sharing policy that
requires deposition before publication of sequence read data and pro-
cessed data into a publicly accessible database (such as the National Center
for Biotechnology Information’s [NCBI] Sequence Read Archive [SRA] or
the European Nucleotide Archive [ENA]). To facilitate data interpretation
and potential meta-analysis, relevant information about such an experi-
ment must also be deposited with the data. Some organizations, such as
the Functional Genomics Data Society, have developed guidelines on what
information should be deposited with the data. For example, the Minimum
Computing Needs for NGS Data Management and Analysis 89
datasets, not just using those from computer simulation but also those from
real-world biological samples. In addition, almost all tools have adjustable
parameters, which should be set equivalently to facilitate performance com-
parison. Also in terms of performance, earlier NGS software usually does
not take advantage of high-performance parallel computing (more details on
parallel computer Chapter 14). To increase performance and take full advan-
tage of the multiple cores or nodes in an HPC system, more recent algorithms
tend to use threading or a message passing interface (MPI) to spread the
work across multiple processes. Therefore, when evaluating NGS tools, it
also helps to examine if these types of parallel processing are employed to
take advantage of the power of multicore computing architecture.
For bioinformaticians who deal with NGS data, on the other hand, the fol-
lowing skills and knowledge are expected:
Application-Specific
NGS Data Analysis
7
Transcriptomics by RNA-Seq
7.1 Principle of RNA-Seq
Transcriptomic analysis deals with the questions of which parts of the
genome are transcribed and how actively they are transcribed. In the past,
these questions were mostly answered with microarray, which is based on
hybridization of RNA samples to DNA probes that are specific to individual
gene-coding regions. With this hybridization-based approach, the repertoire
of hybridization probes, which are designed based on the current annotation
of the genome, determines what genes in the genome or which parts of the
genome are analyzed, and genomic regions that have no probe coverage are
invisible. A next-generation sequencing (NGS)-based approach, on the other
hand, does not depend on the current annotation of the genome. Because
it relies on sequencing of the entire RNA population, hence the term RNA-
Seq, this approach makes no assumption as to which parts of the genome
are transcribed. After sequencing, the generated reads are mapped to the
reference genome in order to search for their origin in the genome. The total
number of reads mapped to a particular genomic region represents the level
of transcriptional activity at the region. The more transcriptionally active a
genomic region is, the more copies of RNA transcripts it produces, and the
more reads it will generate. RNA-Seq data analysis is essentially based on
counting reads generated from different regions of the genome.
By counting the number of reads from transcripts and therefore being
digital in nature, RNA-Seq does not suffer from the problem of signal satu-
ration that is observed with microarrays at very high values. RNA-Seq also
offers a native capability to differentiate alternative splicing variants, which
is basically achieved by detecting reads that fall on different splice junctions.
Whereas some specially designed microarrays, like the Affymetrix Exon
Arrays, can be used to analyze alternative splicing events, standard micro
arrays usually cannot make distinctions between different splicing isoforms.
Also different from microarray signals, which are continuous, raw RNA-Seq
signals (i.e., read counts) are discrete. Because of this difference, distribution
model and methods of differential expression analysis designed for micro
array data cannot be directly applied to RNA-Seq data without modification.
97
98 Next-Generation Sequencing Data Analysis
7.2 Experimental Design
7.2.1 Factorial Design
Before carrying out an RNA-Seq experiment, the biological question to be
answered must be clear and well defined. This will guide experimental
design and subsequent experimental workflow from sample preparation
to data analysis. For experimental design, factorial design is usually used.
Many experiments compare the transcriptomic profile of two conditions,
for example, cancer versus normal cells. This is a straightforward design,
involving only one biological factor (i.e., cell type). Experiments involving a
single factor may also have more than two conditions, for example, compari-
son of samples collected from multiple tissues in the body in order to detect
tissue-specific gene expression.
If a second biological factor (e.g., treatment of a drug) is added to the exam-
ple of cancer versus normal cell comparison, the experiment will have a total
of four (2 × 2) groups of samples (Table 7.1). In this two-factor design, besides
detecting the effects of each individual factor—cell type and drug treatment—
the interacting effects between the two factors are also detected, for example,
drug treatment may have a larger effect on cancer cells than normal cells. If
the factors contain more conditions, there will be a total of m × n groups of
samples, with m and n representing the total number of conditions for each
factor. Experiments involving more than two factors, such as adding a time
factor to the aforementioned example to detect time-dependent drug effects
on the two cell types, are inherently more complex and therefore more chal-
lenging to interpret, because in this circumstance it is not easy to attribute
a particular gene expression change to a certain factor, or especially, to the
interaction of these factors due to the existence of multiple interactions (three
factors involve four different types of interactions).
TABLE 7.1
Experimental Design Involving Two Biological Factors
Cancer Cells Normal Cells
Drug treated Cancer + Drug Normal + Drug
Vehicle treated Cancer + Vehicle Normal + Vehicle
Transcriptomics by RNA-Seq 99
7.2.3 Sample Preparation
Since gene expression is highly plastic and varies greatly with internal (such
as tissue and cell type, developmental stage, circadian rhythm, etc.) and exter-
nal (such as environmental stress) conditions, samples should be collected in
a way that minimizes the effects of irrelevant factors. If the influence of such
factors cannot be totally avoided, they should be balanced across groups. As
many biological samples contain different cell types, this heterogeneity in
cell composition is another factor that may confound data interpretation. The
use of homogeneous target cells is preferred whenever possible, as this will
greatly improve data quality and experimental reproducibility.
To prepare samples for RNA-Seq, total RNA (or messenger RNA [mRNA])
is first extracted from samples of contrasting conditions. As ribosomal RNAs
(rRNAs) are usually the predominant but uninformative component in total
RNA extractions, rRNA species are usually depleted prior to sequencing.
Approaches for rRNA depletion include enrichment of eukaryotic mRNAs
that have poly(A) tails with poly(T) primer-based capturing; the Ribo-Zero
method based on hybridization and then removal of rRNAs with rRNA-
specific RNA probes; degradation by duplex-specific nuclease (DSN), which
relies on denaturation-reassociation kinetics to remove extremely abundant
RNA species including rRNAs [102]; and RNase H selective depletion based
on binding rRNAs to rRNA-specific DNA probes and then using RNase H to
digest bound rRNAs. Without rRNA depletion using one of these approaches,
signals from low-abundance mRNA transcripts might be masked.
Besides rRNA depletion, degradation of RNA molecules in an extracted
sample may also lead to artifactual results. To detect the intactness of RNA
molecules in samples, some quality metrics, such as the RNA integrity num-
ber (or RIN), are often used. It is recommended to use high-quality RNA
samples with no or low levels of degradation, as indicated by high RIN
scores, whenever possible. One prerequisite to extracting high-quality RNA
is to snap-freeze tissue samples whenever possible to avoid potential RNA
degradation. Under circumstances where this is not possible (e.g., sample col-
lection in the field), RNA stabilizing reagent (such as RNAlater) can be used.
100 Next-Generation Sequencing Data Analysis
For RNA samples prepared under certain circumstances such as those from
historical samples or formalin-fixed paraffin-embedded (FFPE) clinical tis-
sues, RNA degradation can be unavoidable. But even from highly degraded
RNA samples such as these, useful data may still be generated [103].
Other experimental factors may also have impacts on RNA-Seq data
generation and subsequent analysis. Contamination of RNA samples with
genomic DNA is one such factor. To remove DNA contaminants, DNase treat-
ment of extracted RNA samples is recommended. In addition, many RNA
extraction protocols do not retain small RNA species including microRNAs
(miRNAs). If these species are also of interest (more on small RNA sequenc-
ing in Chapter 8), alternative protocols (such as the TRIzol method) need
to be used. After RNA extraction, an RNA sequencing library needs to be
constructed, which basically involves reverse transcription to cDNA and
sequencing adapters ligation. This sequencing library construction process
may also introduce bias to the subsequent sequencing and data generation.
For example, using poly(T) oligonucleotides to enrich for mRNA or prime
reverse transcription during this process introduces 3′ end bias, as these pro-
cedures are based on the poly-A tail located at the 3′ end of the vast majority
of eukaryotic mRNAs. This bias precludes analysis of those mRNAs and
other noncoding RNAs that do not have this tail structure [104]. If these RNA
species are of interest, the use of alternative rRNA depletion methods and
random primers in the reverse transcription step can be employed.
7.2.4 Sequencing Strategy
To facilitate subsequent read alignment to identify their origins in the
genome, although single-end long reads will definitely help, use of paired-
end but shorter reads works equally well. Besides read length, how to arrange
samples on a sequencer in terms of lane assignment can also affect the out-
come of an RNA-Seq experiment. On sequencer lane assignment, a balanced
block design [105] should be used to minimize technical variation due to
lane-to-lane or flow cell-to-flow cell difference. In such a design, samples
from different conditions are multiplexed on the same lanes, instead of run-
ning different samples or conditions on separate lanes.
An often-asked question on the conduct of RNA-Seq is how many reads
should be obtained for an RNA-Seq experiment. The answer on this issue of
coverage depth is based on a number of factors, such as the size of the organ-
ism’s genome, the purpose of the study (quantification of low-abundance
genes and alternative splicing variants vs. quick survey of majorly expressed
genes), and ultimately statistical rigor (effect size and statistical power).
Some RNA-Seq power analysis tools, such as Scotty [106], can be used to
help decide on sequencing depth as well as sample size. To start on a spe-
cies or cell type that has not yet been studied, it might be useful to try out
a small number of samples first to get a general idea on the composition of
the target transcriptome and the variability between biological replicates.
Transcriptomics by RNA-Seq 101
In general, for human studies, 100 million reads (after filtering, see Section
7.3.1) are needed to detect ~80% of expressed genes; significantly more reads
(300 million) are needed, however, to find 80% of differentially expressed genes
between conditions. Studying alternative splicing requires more reads due
to the increased resolution. It is estimated that 150 million reads are needed
to detect 80% of splicing events and 400 million reads for finding 80% of
differential splicing events between conditions [107–109]. It should also be
mentioned that the detection power of an RNA-Seq study is not only affected
by sequencing depth but also the number of sample replicates. Sequencing
depth and sample replication provide estimation on gene expression varia-
tion at two different levels, with the former sampling RNA fragments in the
sequencing library (not every RNA fragment is sequenced) and the latter
sampling biological subjects. For projects on budget, it has been reported
that increasing the number of biological replicates is more effective in boost-
ing detection power than increasing sequencing depth [110].
that map within genes (including exons or introns), or intergenic, for those
that map to genomic space between genes.
If the species under study does not have a sequenced reference genome
against which to map RNA-Seq reads, two approaches exist. One is to map
the reads to a related species that has a reference genome, while the alterna-
tive is to assemble the target transcriptome de novo. The de novo assembly
approach is more computationally intensive, but it does not rely on reference
genomic sequence. Currently available de novo transcriptome assemblers
include Oases [128], SOAPdenovo-Trans [129], Trans-ABySS [130], and Trinity
[131]. These de novo assemblers are suited when no related species or only
very distantly related species with a reference genome exists, or the target
genome, despite the available reference sequence, is heavily fragmented or
altered (such as in tumor cells). It should also be noted that if a related refer-
ence genome exists with 85% or higher sequence similarity with the species
under study, mapping to the related genome may work equally well, or even
better, compared to the de novo assembly approach. This is especially true
when studying alternative splicing variants.
gi, j × SF
ei, j =
ai × l j
where ei,j is the normalized expression level of gene j in sample i, gi,j is the
number of reads mapped to the gene in the same sample, ai is the total num-
ber of mapped reads (depth) in sample i, and lj is the length of gene j. SF is
a scaling factor and equals to 109 when ei,j is presented as RPKM or FPKM
104 Next-Generation Sequencing Data Analysis
ensuring that all samples have the same empirical distribution. This method
is used by the package limma, originally designed for microarray data anal-
ysis and now revised for RNA-Seq data [133]. Among other normalization
methods are those that use a list of housekeeping genes or spike-in controls
as the normalization standard. The use of housekeeping genes or spike-
in controls is for conditions in which the assumption that the majority of
genes are not differentially expressed might be violated. In this approach,
a set of constitutively expressed housekeeping genes that are known to stay
unchanged in expression under the study conditions, or a panel of artificial
spike-in controls that mimic natural mRNA and are added to biological sam-
ples at known concentrations, is used as the basis against which other genes
are normalized. Additional methods include those that adjust for putative
bias associated with sample-specific GC content [134].
108 3
2
106
1
104
Variance 102
100
10−2
10−4
100 101 102 103 104 105
Mean
FIGURE 7.1
The overdispersion problem in RNA-Seq data. Poisson distribution is often used to model
RNA-Seq data, but instead of the variance/dispersion being approximately equal to the mean
as assumed by the distribution, the variance in RNA-Seq data is often dependent on the
mean. Line 1 represents the relationship between variance and mean based on the Poisson
distribution, and lines 2 and 3 (dashed) represent local regressions used by DESeq and edgeR,
respectively, based on negative binomial distribution. (Modified from S Anders, W Huber,
Differential expression analysis for sequence count data, Genome Biology 2010, 11:R106.)
a somewhat arbitrary cutoff for gene selection, the GSEA approach increases
sensitivity of the analysis and can pick up weaker signals that might be oth-
erwise missed. For gene network analysis, tools like Cytoscape or Ingenuity
Pathway Analysis (IPA, commercial) are often used. Gene network can be
reconstructed on the basis of currently available experimental evidence, or
coexpression patterns.
111
112 Next-Generation Sequencing Data Analysis
RNA sequencing data analysis shares much commonality with the analysis
of RNA-Seq data (Chapter 7). In the meantime, some aspects of small RNA
sequencing data analysis are unique and mostly focused on in this chapter.
Sequence
reads
miRNA-miRNA*
Pre-miRNA Loop duplex
Dicer *
Argonaute *
cleavage processing
and deep
sequencing
FIGURE 8.1
Deep sequencing of mature small RNAs after Dicer and Argonaute processing. Dicer cleaves a
short stem-loop structure out of pre-miRNA to form the miRNA:miRNA* duplex. Upon load-
ing into RISC, Argonaute unwinds the duplex and uses one strand as a guide for gene silencing
and discards the other strand (the star strand). Although the short stem-loop and star strand
sequences are usually degraded, they may still generate sequencing signals, because of unde-
graded residues or the fact that they may exist to perform other functions (e.g., the star strand
is sometimes functional).
Small RNA Sequencing 113
The small RNA sequencing library construction process starts with ligation
of adapter sequences to their 3′ and 5′ ends. The universal adapter sequences
provide anchoring for subsequent reverse transcription, and then PCR amplifi-
cation. In the polymerase chain reaction (PCR) step, the number of cycles should
be limited to less than 15 (even when the amount of starting material is limited),
otherwise library complexity may be reduced leading to a biased result. For
multiplexed sequencing, indexing sequences should also be incorporated dur-
ing the PCR step as part of the PCR primers. Alternatively incorporating index-
ing sequences during adapter ligation as part of adapter sequence has been
found to lead to serious ligation bias [167,168]. After the PCR step, size selection
is conducted to purify constructs that carry only small RNAs. Although the
library construction process may vary with different sequencing platforms in
technical detail such as the use of different adapters and PCR primers, this
general workflow is usually followed. It should also be noted that some biases,
sometimes unavoidable like in other NGS applications, can be introduced in the
library preparation process, for example, some miRNA sequences may be pref-
erentially captured over others, leading to sequence-specific biases [168–170].
Because of their short length, constructed small RNA sequencing libraries
do not need to be sequenced for very long. The actual read length depends on
the configuration of library constructs and whether the index sequences are
read in the same pass or as a separate reading step. In the current version of the
Illumina small RNA sequencing protocol that reads index sequences in a sec-
ond pass, 50 cycles of sequencing can be enough. Sequencing depth is another
key factor in the data generation process that determines the power of differ-
ential expression analysis and novel small RNA discovery. While this depends
on the sample source, as small RNA amount and composition vary greatly
with cell type and species, in general 4 million to 5 million raw unmapped
reads should offer enough confidence for most studies. A study has shown
that coverage higher than 5 million reads contributes little to the detection of
new small RNA species [171].
8.1.2 Preprocessing
After obtaining sequencing reads and demultiplexing (if the samples are
multiplexed), the reads generated from each sample need to be checked for
base call quality using the quality control (QC) tools introduced in Chapter 5
such as FastQC and FASTX-Toolkit. Because small RNA libraries are usually
sequenced longer than the actual lengths of the small RNA inserts, the 3′
adapter sequence is often part of the generated sequence reads and therefore
should also be trimmed off. The trimming can be carried out with standalone
tools such as Cutadapt and Trimmomatic, or utilities in the FASTX-Toolkit
and NGS QC Toolkit. Adapter trimming can also be conducted contempo-
raneously with mapping, as some mappers provide such an option, or using
data preprocessing modules within some small RNA data analysis tools (to
be covered next).
114 Next-Generation Sequencing Data Analysis
8.1.3 Mapping
For mapping small RNA sequencing reads to a reference genome, short read
aligners introduced in Chapter 5, such as Bowtie/Bowtie2, BWA, Novoalign,
or SOAP/SOAP2, can be used. Among these aligners, Novoalign offers the
option of stripping off adapter sequences in the mapping command. As for
the reference genome, the most recent assembly should always be used.
Because of the short target read length, the number of allowed mismatches
should be set as 1. To speed up the mapping process, a multithreading
parameter, which enables the use of multiple CPU cores, can be used if the
aligner supports it. After mapping, reads that are aligned to unique regions
are then searched against small RNA databases to establish their identities
(see Section 8.1.4), while those that are mapped to a large number (e.g., >5000)
of genomic locations should be removed from further analysis.
Besides the aforementioned general tools for small RNA reads preprocess-
ing and mapping, tools have also been developed specially for small RNA
analysis, including DSAP [172], miRanalyzer [173], miRDeep/ miRDeep2
[174], miRExpress [175], miRNAKey [176], and mirTools [177]. Among these
tools, miRanalyzer was among the first developed and is currently one of
the most widely used methods. It provides functions for data preprocess-
ing, including 3′ adapter sequence removal, and uses Bowtie for mapping.
Both miRDeep2 and mirTools also have modules for data preprocessing, and
mapping with the use of Bowtie and SOAP, respectively. DSAP, miRNAKey,
and miRExpress all have preprocessing functionalities, but instead of map-
ping to a reference genome, they map the reads to noncoding RNA (ncRNA)
databases including miRBase and Rfam [178]. Rfam is an annotated database
for ncRNA families with each family containing a series of RNA sequences
that share a common ancestor.
While the mapping of small RNA reads to a reference genome is similar
to the mapping in RNA-Seq, as covered in Chapter 7, some characteristics
of small RNAs, mostly their short length and posttranscriptional editing,
present different challenges from the small RNA reads mapping process.
Because of their short length, sizeable numbers of small RNA reads are usu-
ally mapped to more than one genomic region. In comparison, this issue is
minimal for RNA-Seq data, as longer and sometimes paired-end reads greatly
increase specificity. The easiest way to deal with multimapped small RNA
reads is to simply ignore them, but this leads to the loss of great amounts of
data. A more commonly used approach is to randomly assign them to one of
the mapped positions, while an alternative approach is to report them to all
possible positions. More sophisticated algorithms have also been developed
in an effort to avoid the precision or sensitivity pitfalls of these approaches.
For example, one package called Butter (Bowtie UTilizing iTerative placE-
ment of Repetitive small RNAs) [179] makes assignment to one of the pos-
sible positions based on the relative local densities of other more confidently
assigned reads.
Small RNA Sequencing 115
8.1.5 Normalization
Before identifying differentially expressed small RNAs, read counts for each
small RNA species in the samples need to be normalized. The goal of normal-
ization is to make the samples directly comparable by removing unwanted
sample-specific variations, which are usually due to differences in library
size and therefore sequencing depth. The normalization approaches used in
116 Next-Generation Sequencing Data Analysis
Once a list of potential target genes are generated, functional analysis, such
as Gene Ontology (GO) and pathway analysis, can be conducted using the
approaches detailed in Chapter 7. In addition, for pathway analysis, a list of
miRNAs can also be uploaded directly to the DIANA miRPath web server
to a generate a list of biological pathways that are significantly enriched with
the miRNAs’ target genes, which are predicted with DIANA-microT-CDS or
documented with existing experimental evidence [197].
9
Genotyping and Genomic Variation
Discovery by Whole Genome Resequencing
119
120 Next-Generation Sequencing Data Analysis
Read mapping
Local realignment
Variant calling
(SNVs, mutations, indels, and SVs)
Variant annotation
Association test
(between variants and diseases/phenotypes)
FIGURE 9.1
General workflow for genotyping and variation discovery from resequencing data.
C
T
T
C
T
C
C
FIGURE 9.2
The variant calling process is usually affected by various factors. In this illustration, a number
of reads are aligned against a reference sequence (bottom). At the illustrated site, the reference
sequence has a C, while the reads have C and T. Depending on the factors mentioned in the
text and prior information, this site can be called heterozygous (C/T), or no variation (C/C)
if the T’s are treated as errors. It is also possible to be called a homozygous T/T, if the C’s are
regarded as errors.
122 Next-Generation Sequencing Data Analysis
Reference
ACTCCCGTCGGAACGAATGCCACG
genome
ACTCCCGTCGGAACCAATGCC - - -
- CTCCCGTCGGAACCAATGCCACC
- - - CCCGTCGGAACCAATGCCACG
- - - - - CGTCGGAACCAATGCCACG
- - - - - CATCGGAACCAATGCCACC
Normal - - - - - - GTCGGAACCAATGCCACG
- - - - - - - - - - - - - - CAATGCCACC
- - - - - - - - - - - - - - - - - - - - CACC
Allelic counts
aN 122335566666660777778773
dN 122335666666667777778777
ACTCCCGTCGGAACCAATGCCACC
- - TCCCGTCGGAACCAATGCCACC
- - - CCCGTCGGAACCAATGCCACC
- - - - - - GTCGGCACCAATGCCACG
- - - - - - - - CGGCACCAATGCCACG
Tumor - - - - - - - - - - GCACCAATGCCACG
- - - - - - - - - - - - - - - AATGCCACG
- - - - - - - - - - - - - - - - - - - CCACG
aT 112333445563660777788883
Allelic counts
dT 112333445566666777788888
Germline
Somatic (AA,AB) (AB,BB) (AB,AB)
FIGURE 9.3
De novo somatic mutations versus inherited germline variations. In this example, sequence
reads from normal and tumor tissues are aligned to the reference genome (shown at the top).
The allelic counts, that is, the number of matches (aN and aT) and depth of reads (dN and dT), at
each base position are shown. The light gray sites indicate germline variants, while the dark
gray indicates a de novo somatic variant acquired in some tumor cells. Also shown at the bot-
tom are the predicted genotypes for the normal and tumor tissues. (Modified from A Roth,
J Ding, R Morin, A Crisan, G Ha, R Giuliany, M Hirst et al., JointSNVMix: A probabilistic
model for accurate detection of somatic mutations in normal/tumour paired next-generation
sequencing data, Bioinformatics 2012, 28 (7):907–913. With permission.)
124 Next-Generation Sequencing Data Analysis
carrying somatic mutations from the same patient) are independently aligned
to and variants called against a reference genome. The called variants in the
contrasting samples are then compared to each other to locate somatic muta-
tions in the cancer tissue. In the latter approach, the samples are directly
compared to each other using statistical tests on the basis of joint probability.
9.2.3 Indel Calling
Calling small indels (large indels are covered in Section 9.3 on SV calling),
which occur at a frequency of about 1 in 8000 bp in the genome, is more chal-
lenging than calling SNVs. This is because the existence of indels in a read
could interfere with the read’s accurate mapping. The mapping for small indel
calling, therefore, should allow insertions or deletions that involve a few bases.
After mapping, a simple approach to indel calling is to extract insertion and
deletion information from the sorted BAM file using Samtools (varFilter). This
approach, while simple, often shows high false-positive and false-negative
rates. More complex approaches, such as Dindel [209] or GATK, can be used for
improved performance. The basic workflow of these approaches is (1) scan the
input BAM file for insertions and deletions; (2) for each indel site, build a new
haplotype based on the indel event; (3) realign all sequence reads to the newly
created alternative haplotype; (4) count the number of reads that support the
indel in the alternative haplotype; and, finally, (5) make indel calls.
Another approach that addresses the challenge of calling indels is based
on local de novo assembly. SOAPindel [210] is an example of this approach.
With the use of paired-end reads, this approach first identifies mapped reads
that have unmapped mates, and then positions the unmapped reads at their
expected genomic locations. A local de novo assembly is subsequently built
with high density of such unmapped reads and aligned to the reference to
identify indels. This approach is computationally more intensive, especially
when the reference genome is large.
FIGURE 9.4
The VCF format (version 4.2). (From http://samtools.github.io/hts-specs/; the format is cur-
rently managed by the Global Alliance Data Working Group File Formats Task Team.)
126 Next-Generation Sequencing Data Analysis
TABLE 9.1
Mandatory Fields in a VCF File
Col Field Type Description
1 #CHROM String Chromosome number
2 POS Integer Start position of the variation
3 ID String Database identifier
4 REF String Reference allele
5 ALT String Alternate allele(s)
6 QUAL Numeric Quality score (Phred-style)
7 FILTER String Filter status
8 INFO String User extensible information
Paired reads
Insert
Reference
sequence
Insertion Deletion
P1 P1 P2 P2
P1 P2 P1 P2
Translocation Inversion
FIGURE 9.5
Common SVs and the basic approach to detect them using paired reads.
of NGS, especially the use of paired-end reads, has greatly pushed SV detec-
tion forward. As illustrated in Figure 9.5, the basic approach to locate large
indels, inversions, and translocations is based on changes in orientation or dis-
tance between paired reads. SV detection algorithms that employ this general
approach include BreakDancer [217], GASV [218], HYDRA [219], PEMer (Paired-
End Mapper) [220], and SVDetect [221]. Figure 9.6 shows the general algorithmic
procedure for calling SVs using this approach. The first step is to separate read
pairs into concordant and discordant groups, defined by the distance between a
read pair matching or deviating from the expected distance based on the refer-
ence genome. The discordant read pairs are then assembled into different clus-
ters based on the genomic region they cover to generate candidate SV calling
regions. In the last step, the candidate SV clusters are filtered based on statisti-
cal assessment so that only clusters that are covered by multiple read pairs are
reported as SVs. The bounds on possible breakpoints in the region are also iden-
tified in this step (indicated by the shaded area in Figure 9.6d).
(a) (b)
(c) (d)
FIGURE 9.6
General steps of calling SVs using paired-end reads. (a) Paired reads are mapped to the reference
genome. (b) Discordant read pairs are identified. (c) Discordant read pairs are assembled into
clusters. (d) Candidate clusters of discordant read pairs are filtered to identify SVs, and bounds
on possible breakpoints identified. (From C Whelan, Detecting and analyzing genomic struc-
tural variation using distributed computing, 2014, Scholar Archive, Paper 3482. With permission.)
128 Next-Generation Sequencing Data Analysis
9.3.2 Breakpoint Determination
Although the aforementioned read-pair based approach can be used to locate
most SV events (except multiple copy number duplications), they cannot be
used to locate exactly where the breakpoints are in the genome. This is due
to the fact that the distance between paired reads is dependent on the size of
the fragment of their origin, which is not exact even under the best experi-
mental conditions. To locate the breakpoints in these events, a split-read
based approach may be used, which locates breakpoints by splitting some
reads into subsequences that map to different genomic regions. Algorithms
that use this approach include CREST [222], Pindel [223], SplazerS [224], and
SRiC [225]. Pindel, for example, first searches for read pairs in which one
read aligns to the reference genome but the other does not. Based on the
assumption that the second read contains a breakpoint, it uses the aligned
read as anchor to scan the surrounding regions for split mapping of the sec-
ond read. Although it can locate breakpoints at single base resolution, this
approach is computationally expensive because of the challenge associated
with aligning read subsequences to different genomic regions with gaps in
between.
9.3.4 CNV Detection
Detection of variation in segmental copy numbers is usually conducted with
algorithms that detect abnormal changes in regional read frequency. These
algorithms are based on the assumption that the number of reads obtained
from a region is proportional to its copy number in the genome. If a genomic
segment is repeated multiple times, a significantly higher number of reads
will be observed from the segment compared to other nonrepeated regions.
If a segment is deleted, on the other hand, there will be no read coverage
for it. Examples of these algorithms include CNAseg [227], CNV-Seq [228],
CNVnator [229], Event-Wise Testing (EWT) [230], JointSLM [231], mrFAST,
and SegSeq [232]. As other factors, such as GC content, may also affect local
Genotyping and Genomic Variation Discovery 129
9.3.5 Integrated SV Analysis
The different software tools introduced earlier are usually tailored for detect-
ing particular types (or aspects) of SVs. In order to make calls for the full
range of SVs, there have been efforts to take an integrated approach toward
comprehensive SV calling using the different but often complementary tools.
SVMerge, being one of these efforts, integrates SV calling results from dif-
ferent callers [233]. It first feeds BAM files into a number of SV callers such
as those introduced earlier to generate BED files, and then the SV calls in the
BED files are merged. A comprehensive list of SVs is generated after com-
putational validation with breakpoints refined by local de novo alignment.
Other efforts that take a similarly integrated approach include GASVPro
[234], SVSeq [235], and CNVer [236].
Sanger sequencing was considered the golden standard for de novo genome
assembly. However, it is prohibitively expensive and time-consuming
to assemble a genome using this first-generation technology, as it took
$3 billion and 13 years to generate the human genome draft assembly. The
demand for low-cost and fast genome sequencing provides the very impe-
tus for the development of next-generation sequencing (NGS) technolo-
gies. The dramatically reduced cost of NGS makes whole-genome shotgun
sequencing much more affordable and accessible to individual labs. De novo
genome assembly from the relatively short and enormous number of reads
generated from most NGS platforms, however, poses serious challenges to
assembling algorithms that were designed for Sanger sequences. The short
length of NGS reads means that they carry less information and as a result
lead to more uncertainties in the assembling process. To remedy this situa-
tion, higher coverage is required, which significantly increases the number
of reads required and therefore the computational complexity. For example,
using Sanger sequences with lengths up to 800 bp, assembling the human
genome used approximately 8× coverage; for NGS reads of 35 to 100 bp, the
same task needs 50× to 100× coverage [244].
Since Sanger sequence assemblers cannot deal effectively with these chal-
lenges, new de novo genome assemblers have been developed for NGS data.
The development of Velvet [245] and ABySS [246] in 2008 and 2009 showed
that de novo high-quality genome assembly can be achieved, even for large
genomes, using massive numbers of ultrashort (as short as 30 bp) reads. The
first de novo assembly of a human genome with the use of only short NGS
reads was accomplished in 2010 with the development of SOAPdenovo [96].
With the recent rapid algorithmic developments in this direction, along with
the gradual increase in read length, de novo genome assembly from NGS
sequences has been becoming more and more robust.
131
132 Next-Generation Sequencing Data Analysis
is first circularized to have the two ends joined. This circular DNA is then
fragmented, and the segment that contains the junction of the two ends is
selected and sequenced with paired-end sequencing. To span repetitive
regions in different sizes, sequencing reads generated from mate-pair librar-
ies of varying insert sizes (e.g., from 2 to 40 Kb), as well as regular paired-end
reads, are often used [247,248].
The combined use of paired-end and mate-pair libraries of different
insert sizes is a key strategy in assembling a genome from NGS reads. The
paired-end sequencing generates reads at the shorter size range (e.g., 180 bp)
for assembling of nonrepeat sequences as well as resolving short repeat
sequences, whereas the mate-pair jump sequencing produces reads at the
larger size range for resolving intermediate and long-range repeat regions
and fill the corresponding gaps. Gaps of substantial sizes that are beyond the
covering range of mate-pair libraries cannot be filled.
Besides the use of paired-end and mate-pair sequencing, read length is
also a key parameter for de novo genome assembly. Although mammalian
genomes have been assembled from reads shorter than 75 bp [96,248], lon-
ger reads are always better. To obtain long reads, some sequencing plat-
forms, such as that from Pacific Biosciences, can be used. To balance read
length with cost and error rate, NGS systems that usually do not read long
sequences, such as the Illumina system, can also be used ingeniously to pro-
duce longer reads. For example, the current Illumina system can sequence
250 nucleotides from one end using the rapid run mode. If using this mode
to conduct paired-end sequencing on libraries that contain DNA inserts of
450 bp, each generated read pair will overlap, and with software they can be
merged to form a single long read covering the entire length of 450 bp. This
strategy, combined with the use of mate-pair libraries of different sizes, or
different sequencing technologies, can to a large degree overcome the limita-
tion imposed by a short read length. Read length will become a lesser issue
with the emergence of long-read NGS technologies.
Sequencing depth is another important factor to consider for a de novo
assembly project. While it varies by project and is dependent on the other
factors (including the amount of repeats and level of heterozygosity in the
genome as well as read length and error rate), a coverage that is too low
will undoubtedly result in a highly fragmented assembly. As a rough guide,
in the combined use of paired-end and mate-pair libraries of various insert
sizes, 45× to 50× coverage is needed for the short-insert-size paired-end
and intermediate-size (3 to 10 Kb) mate-pair libraries, and 1× to 5× cover-
age for the long-insert (10 to 40 Kb) mate-pair libraries [249,250]. It should
also be noted that while higher coverage may lead to improvement in the
final assembly quality, additional increase in coverage also means increased
data volume, computational complexity, and processing time. There are also
studies showing that beyond a certain level of coverage, further increase in
sequencing depth does not necessarily lead to an increase in assembly qual-
ity in terms of the size of assembled contigs [96].
134 Next-Generation Sequencing Data Analysis
10.2 Assembly of Contigs
10.2.1 Sequence Data Preprocessing, Error Correction,
and Assessment of Genome Characteristics
The de novo assembly of a genome from NGS reads is a multistep process
(Figure 10.1). As the first step, sequence data quality needs to be inspected.
The data quality control (QC) steps described in Chapter 5 can be performed
here to examine per-base error rate, quality score distribution, read size dis-
tribution, contamination of adaptor sequences, and so on. Low-quality reads
need to be filtered out, and portions of reads that contain low-quality base
calls (usually the 3′ end), ambiguities (reported as N’s), or adaptor sequences
should be trimmed off. As part of data preprocessing, paired-end reads with
part of their sequences overlapped need to be merged to generate longer reads.
The read merging can also correct errors if discrepancy at some base positions
are observed, in which case the higher quality base call is used. The merging
process can be handled by tools such as FLASH [251] or PANDAseq [252].
Sequencing error correction is an important step for de novo read assem-
bly, more so than for most other NGS applications due to the fact that the
Contig assembly
(greedy, OLC, and de Bruijn approaches)
Scaffold construction
Genome assembly
quality evaluation
Gap closure
FIGURE 10.1
General workflow for de novo genome assembly.
De Novo Genome Assembly from NGS Reads 135
assembly process is much more sensitive to these errors. The data QC mea-
sures mentioned earlier cannot totally remove sequencing errors, as high
base-call quality scores alone cannot guarantee a read is free of sequenc-
ing errors. If left uncorrected, the errors will lead to prolonged computa-
tional time, erroneous contigs, and low-quality genome assembly. While
it can be time consuming, an additional error correction step can improve
final assembly quality. There are multiple options to carry out this step. For
example, the Quake error corrector [253] can be used as a standalone tool,
while some assemblers (see Section 10.2.2) have their own error correction
modules, such as ALLPATHS-LG [254]. Most error correction algorithms are
based on k-mer filtering [255]. K-mer refers to all the possible subsequences
of length k in a read, and breaking reads to k-mers makes the complicated
task of genome assembly more tractable. When all reads are converted to
k-mers, most k-mers in the pool are represented multiple times. Having a
k-mer that appears only once or twice is an indication of sequencing error
(Figure 10.2). The general error correction approach is to find the smallest
number of base changes to make all k-mers contained in a read “strong,”
that is, with the frequency of these k-mers from all reads above a thresh-
old level. To determine the appropriate threshold level for error correction,
the distribution of the frequency of k-mers can be plotted using data from
a k-mer counting software such as Jellyfish [256]. From the distribution pat-
tern, the size of the to-be-assembled genome, as well as coverage, can also
be estimated. For example, tools such as Kmergenie [257], SGA [258], and
VelvetOptimiser [259] provide reports on genome size and coverage from
the k-mer distribution pattern. Some of these tools, like SGA, also report on
other characteristics of the genome, such as repeat content and the level of
heterozygosity.
0.015
Error k-mers
0.010
True k-mers
Density
0.005
0.000
0 20 40 60 80 100
Coverage
FIGURE 10.2
The coverage profile of true k-mers and those with sequencing errors. (From DR Kelley, MS
Schatz, SL Salzberg, Quake: Quality-aware detection and correction of sequencing errors,
Genome Biology 2010, 11:R116. Used under the terms of the Creative Commons Attribution
License, http://creativecommons.org/licenses/by/2.0, © 2010 Kelley et al.)
136 Next-Generation Sequencing Data Analysis
(a) (b)
FIGURE 10.3
Comparison of the (a) OLC and the (b) de Bruijn graph approaches for global de novo genome
assembly. In the OLC example, six sequence reads (R1–R6) are shown for the illustrated
genomic region, with each read being 10 bp in length and the overlap between them set at
≥5 bp. The reads are laid out in order based on how they overlap. The OLC graph is shown at
the bottom, with many nodes having more than one incoming or outgoing connection. In the
de Bruijn graph example, the reads are cut into a series of k-mers (k = 5). In total there are 16
such k-mers, many of which occur in more than one read. The k-mers are arranged sequen-
tially based on how they overlap, and the de Bruijn graph built from this approach is shown
at the bottom. Different from those in the OLC graph, the majority of the nodes in this graph
have only one incoming and one outgoing connection. (From Z Li, Y Chen, D Mu, J Yuan, Y Shi,
H Zhang, J Gan, et al., Comparison of the two major classes of assembly algorithms: Overlap–
layout–consensus and de-Bruijn-graph, Briefings in Functional Genomics 2012, 11(1):25–37. With
permission.)
10.3 Scaffolding
After assembly of contigs, the next step is to organize the contigs into a “scaf-
fold” structure to improve continuity rather than leave them disjointed. This
scaffolding process orders and orients the contigs, and estimates the lengths
of the gaps between them (Figure 10.4). To establish positional relationship
between contigs, scaffolding algorithms use mate-pair reads that span dif-
ferent contigs.
For input, scaffolding algorithms take preassembled contigs, mate-pair
reads, and sometimes long reads generated from other sequencing technolo-
gies (such as 454 or PacBio). The first and also an important step in the scaf-
folding process is to map the input read pairs or long reads to the contigs. To
improve mapping results, sequencing errors in the reads should be corrected
prior to mapping. To assemble the contigs into scaffolds using the guiding
information in the mate-pair or long reads, scaffolders usually take a graph-
based approach similar to the contig assembly process, but here with contigs
as nodes and connecting read pairs (or long reads) as edges. The quality
Scaffold
Contig 1 Contig 2
DNA fragments
Paired-end read
Approximate length, but no known sequence
FIGURE 10.4
Assembling contigs into a scaffold.
De Novo Genome Assembly from NGS Reads 139
10.5 Gap Closure
The final stage of finishing a de novo genome assembly is to close the gaps
between contigs. A standard approach to achieving this is to employ PCR,
first to amplify the gapped regions using primers specific to the ends of the
two contigs bordering the gaps, followed by sequencing of the amplicons.
If the number of gaps is high, however, this approach can be laborious and
expensive. Alternatively, gap filling software, such as IMAGE [283], GapFiller
[284], or gap filling modules in some assemblers (such as SOAPdenovo) can
be used to close the gaps using paired reads generated from the gapped
regions. For example, IMAGE uses a targeted reassembly process in the gap
region to create new contigs to gradually fill the entire gap (Figure 10.5). It
first collects read pairs that align to contig ends and uses them to create new
contigs that extend into the gap region. After incorporating the new contigs
into the scaffold, this process is reiterated until the entire gap is filled.
Merged contig
FIGURE 10.5
Gap closing with the IMAGE process. (From IJ Tsai, TD Otto, M Berriman, Improving draft
assemblies by iterative mapping and assembly of short reads to eliminate gaps, Genome Biology
2010, 11:R41. Used under the terms of the Creative Commons Attribution License, http://
creativecommons.org/licenses/by/2.0, © 2010 Tsai et al.)
11.1 Principle of ChIP-Seq
Without the involvement of DNA-interacting proteins, the information coded
in DNA could not be accessed, transcribed, and maintained. Besides a large
number of transcription factors and coactivators, key DNA-interacting pro-
teins include histones, DNA and RNA polymerases, and enzymes for DNA
repair and modification (e.g., methylation). Through their DNA-interacting
domains, such as helix-turn-helix, zinc finger, and leucine zipper domains,
these proteins interact with their DNA targets by hydrogen bonding, hydro-
phobic interactions, or base stacking. Because the intimate relationship
between DNA and these proteins plays an important role in the function-
ing of the genome, studying how proteins and DNA interact and where
DNA-interacting proteins bind across the genome provides key insights into
the many roles these proteins play in various aspects of genomic function,
including information exposure, transcription, and maintenance.
ChIP-Seq is a next-generation sequencing (NGS)-based technology to
locate binding sites of a DNA-interacting protein in the genome. An exem-
plary scenario for using ChIP-Seq is to study transcription factor binding
profiles in the genome under different conditions, such as development
stages or pathological conditions. To achieve this, the protein of interest is
first cross-linked covalently to DNA in cells with a chemical agent, usually
formaldehyde (Figure 11.1). Then the cells are disrupted, and subsequently
sonicated or enzymatically digested to shear chromatin into fragments that
contain 100 to 300 bp DNA, followed by enrichment of the target protein with
its bound DNA by immunoprecipitation using an antibody specific for the
protein. Subsequently, the enriched protein-DNA complex is dissociated by
reversing the cross-links previously formed between them, and the released
DNA fragments are subjected to NGS. One key experimental factor in the
ChIP-Seq process is the quality of the antibody used in the enrichment step,
as the use of a poor-quality antibody can lead to high experimental noise due
to nonspecific precipitation of DNA fragments.
143
144 Next-Generation Sequencing Data Analysis
Cross-link
and
shear DNA-protein
complexes
DNA with
interacting proteins
1 2
G Immuno-
G precipitate
A
C
G
T
T
C
DNA Purify
sequencing
4
3
Map reads to
reference genome
G G A C G
A C G TT C G T
G G A C G T T C Tag distribution
G A C G T
5
FIGURE 11.1
The basic steps of ChIP-Seq. (From AM Szalkowski, CD Schmid, Rapid innovation in ChIP-seq
peak-calling algorithms is outdistancing benchmarking efforts, Briefings in Bioinformatics 2011,
12(6):626–633. With permission.)
Mapping Protein–DNA Interactions with ChIP-Seq 145
Based on the size of the region(s) that they bind, DNA-interacting proteins
can be divided into three groups:
11.2 Experimental Design
11.2.1 Experimental Control
Appropriate control for a ChIP-Seq experiment is the key to account for arti-
facts or biases that might be introduced into the experimental process. These
artifacts and biases may include potential antibody cross-reactivity with
nonspecific protein factors, higher signal from open chromatin regions (since
they are easier to fragment than closed regions [287]), and uneven sequenc-
ing of captured genomic regions due to variations in base composition. Two
major types of controls are usually set up for ChIP-Seq signal adjustment.
One is input control, that is, chromatins extracted from cells or tissues,
which are subjected to the same cross-linking and fragmentation procedure
but without the immunoprecipitation process. The other is “mock” control,
which is processed by the same procedure including immunoprecipitation;
the immunoprecipitation, however, is carried out using an irrelevant anti-
body (e.g., IgG). While it may seem to serve as the better control between
the two, the mock control often produces much less DNA for sequencing
than real experimental ChIP samples. Although sequencing can be carried
out on amplified DNA in this circumstance, the amplification process adds
additional artifacts and bias to the sequencing data, which justifies the use of
input DNA as the experimental control in many cases.
11.2.2 Sequencing Depth
How many reads to obtain for a ChIP-Seq experiment depends on the size
of the genome and how many binding sites the protein of interest has in the
genome. A good indication of having reached sufficient sequence depth is
when the number of protein binding sites reaches plateau with the increasing
146 Next-Generation Sequencing Data Analysis
11.2.3 Replication
To examine the reproducibility of a ChIP-Seq experiment and to reduce the
false discovery rate (FDR), replicate samples should be used. If a protein of
interest binds to regions of the genome with high affinity, the bound regions
should be identified in replicate samples. Regions that are not identified in
replicates are most possibly due to experimental noise. The Pearson correla-
tion coefficient (PCC) between biological replicates serves as a measurement
of experimental reproducibility, and the irreproducible discovery rate (IDR)
is another such metric. The calculation and usage of the PCC and IDR will be
detailed later in this chapter.
Experimental design
(controls, sequencing depth, replication)
Read mapping
Peak calling
Peak visualization
Functional analysis
(peaks assigned to nearby genes)
FIGURE 11.2
Basic ChIP-Seq data analysis workflow.
sensitivity but at the expense of higher FDR, while excluding them improves
specificity but at the risk of losing true signals. The choice for their inclusion
or not, therefore, is dependent on whether sensitivity or specificity is a prior-
ity. Independent of whether multireads are used, the percentage of uniquely
mapped reads reported from the mappers is indicative of data quality. If this
value is below 50%, it may indicate a potential problem with the experimen-
tal procedure and caution should be used in the interpretation of the data.
ChIP-Seq involving proteins that bind to repetitive regions of the genome
may also generate a low percentage of uniquely mapped reads.
For ChIP-Seq reads mapping, it is also worth mentioning that ChIP is an
enrichment, not purification, of protein-bound DNA sequences. As a result,
more reads are usually generated from background noise than from bound
regions. The background noise can be determined empirically with the use
of control samples. The distribution of observed background noise is not ran-
dom as many would expect (Figure 11.3). Instead, it is affected by the den-
sity of mappable reads in different genomic regions and the local chromatin
structure (e.g., as previously mentioned, an open chromatin structure gener-
ates more background reads). True binding signals in ChIP-Seq samples are
100
80
60
* * *
40
20
Pol II ChIP-Seq
0
100
80
60
40
20
Input DNA
0
100
80
60
40 Interferon-γ–stimulated STAT1 ChIP-Seq
20
0
100
80
60
40
20
Interferon-γ–stimulated input DNA
0
1
0.8
0.6
0.4
0.2
0
Mappable bases (1 Kb)
XPNPEP3 RBX1 EP300
0.000 39,500.000 39,520.000 39,540.000 39,560.000 39,580.000 39,600.000 39,620.000 39,640.000 39,660.000 39,680.000 39,700.000 39,720.000 39,740.000 39,760.000 39,780.000 39,800.000 39,820.000 39,840.000 39,860.000 39,880.000
SLC25A17 ST13
FIGURE 11.3
Background noise and signal profiles in a ChIP-Seq experiment. Shown here is the density
of mapped reads in one region of the human chromosome 22 for RNA polymerase II and the
transcription factor STAT1 (tracks 1 and 3, count from the top), respectively. Genes coded by
the two DNA strands in this region are displayed at the bottom. Tracks 2 and 4 show the dis-
tribution of mapped reads for the respective input DNA controls for the two proteins. It should
be noted that some of the peaks in the protein tracks are also present in their input controls.
Track 5 displays the fraction of uniquely mappable bases. (From J Rozowsky, G Euskirchen,
RK Auerbach, ZD Zhang, T Gibson, R Bjornson, N Carriero et al., PeakSeq enables systematic
scoring of ChIP-seq experiments relative to controls, Nature Biotechnology 2009, 27:66–75. With
permission.)
Mapping Protein–DNA Interactions with ChIP-Seq 149
11.3.2 Peak Calling
Peak calling, the process of finding regions of the genome to which the pro-
tein of interest binds, is a key step in ChIP-Seq data analysis. It is basically
achieved through locating regions where reads are mapped at levels signifi-
cantly above the background. The simplest way of peak calling is to count
the total number of reads mapped along the genome and call each location
with the number of mapped reads over a threshold as a peak. Due to the
inherent complexity in ChIP-Seq signal generation, including uneven back-
ground noise and other confounding experimental factors, this approach is
overly simplistic. Among the experimental factors, the way the immunopre-
cipitated DNA fragments are sequenced on most platforms has a direct influ-
ence on how peaks are called. Since the reads are usually short, only one end
or both ends of a fragment, instead of the entire fragment, are sequenced.
To locate a target protein’s binding regions, which are represented by the
immunoprecipitated DNA fragments and not just the generated reads, peak
calling algorithms need to either extend or shift the reads to cover the actual
binding areas. For example, PeakSeq extends each mapped read in the 3′
150 Next-Generation Sequencing Data Analysis
Tag counts
0.06
Tag density
−20 0.04
−5
Crosslink 0.02
−40
Fragmentation
FIGURE 11.4
The distribution of ChIP-Seq reads around the actual binding region and their positional shift on the two DNA strands. (a) How ChIP-Seq reads are
Mapping Protein–DNA Interactions with ChIP-Seq
produced from cross-linked and fragmented DNA. The cross-linking between the protein and the bound DNA can occur at different sites, as does
the fragmentation of the DNA. Each fragment is read at its 5′ end (indicated by the squares). These reads, serving as sequence tags of each fragment,
are clustered around the actual binding region from the two sides depending on which strand they come from. The dashed line depicts a fragment
from a long cross-link. (b) The distribution of sequence tag signal around the binding region. Vertical lines represent counts of sequence tags whose
5′ end maps to each nucleotide position on the positive and negative strands (displayed as positive and negative values, respectively). The solid curves
represent smoothed tag density along each strand. Since the two curves approach the binding site from the two sides, there is a gap between their
peaks. (c) Strand cross-correlation associated with shifting the strands across the gap. Before shifting the strands, the Pearson correlation coefficient is
calculated between the tag density of the two strands. When sequence tags mapped to the two strands are shifted relative to each other (shown on the
x-axis), the Pearson correlation coefficient gradually changes (y-axis). The dashed line at x = 0 corresponds to the strand cross-correlation before the shift,
while the one at the peak corresponds to the highest cross-correlation coefficient at the strand shift that equals to the average length of the DNA frag-
ments. (From PV Kharchenko, MY Tolstorukov, PJ Park, Design and analysis of ChIP-seq experiments for DNA-binding proteins, Nature Biotechnology
2008, 26:1351–1359. With permission.)
151
152 Next-Generation Sequencing Data Analysis
0.0400.045
Cross-correlation
0.035 0.030
Cross-correlation
Cross-correlation
Cross-correlation
cc(read_length)
min(cc)
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
Strand shift Strand shift Strand shift
cc(fragment length) cc(fragment length)−min(cc)
NSC = RSC =
min(cc) cc(read length)−min(cc)
FIGURE 11.5
The “phantom” peak and its use in determining ChIP-Seq data quality. The phantom peak
corresponds to the cross-correlation at the strand shift that equals to the read length, while
the ChIP peak corresponds to the cross-correlation at the shift of the average DNA fragment
length. A successful run is characterized by the existence of a predominant ChIP peak and a
much weaker phantom peak. In marginally passed or failed runs, the former diminishes while
the latter relatively becomes much stronger. (Adapted from SG Landt, GK Marinov, A Kundaje,
P Kheradpour, F Pauli, S Batzoglou, BE Bernstein et al., ChIP-seq guidelines and practices of the
ENCODE and modENCODE consortia, Genome Research 2012, 22(9):1813–1831. With permis-
sion. ©2012 Cold Spring Harbor Laboratory Press.)
After peak calling, artifactual peaks need to be filtered out, including those
that contain only one or a few reads that are most possibly due to PCR arti-
facts, or those that involve significantly imbalanced numbers of reads on the
two strands (see Figure 11.6).
For implementation of this peak calling process, different peak callers use
different methods, which can lead to differences in final outcomes. Table 11.1
shows some of the currently available peak callers. PeakSeq, MACS/MACS2,
HOMER (findPeaks module) [302], and SPP [300] are among some of the pop-
ular ones. As previously mentioned, the peak calling employed in PeakSeq
is a two-pass process. In the second pass, peaks are called by scoring reads-
enriched target regions based on calculation of the fold enrichment in the
Mapping Protein–DNA Interactions with ChIP-Seq 153
Tag count
Position (bp) Position (bp)
signal
Enrichment relative
to background
Position (bp)
Assess significance
Filter artifacts
P(s)
sthresh Tags
S Position (bp)
FIGURE 11.6
Basic substeps of calling peaks from ChIP-Seq data. The P(s) at bottom left signifies the prob-
ability of observing a location covered by S mapped reads, and the sthresh marks the thresh-
old for calling a peak significant. Bottom right shows two types of possible artifactual peaks:
single strand peaks and those based on mostly duplicate reads. (From S Pepke, B Wold, A
Mortazavi, Computation for ChIP-seq and RNA-seq studies, Nature Methods 6, 2009:S22–S32.
With permission.)
154 Next-Generation Sequencing Data Analysis
TABLE 11.1
ChIP-Seq Peak Calling Algorithms
Name Description Reference
CCAT Designed to identify weak ChIP signals 303
CisGenome Features multifaceted interactive analysis and 294
customized batch-mode computation
E-RANGE A Python package for both ChIP-Seq and RNA-Seq data 304
analysis
F-Seq Generates continuous genomic sequence density data 305
for easier visualization and interpretation
GLITR Uses classification to identify regions that have peak 306
height and fold-change not resembling those in control
HOMER Identifies peaks based on the principle that more 302
(findPeaks sequencing reads are found in these regions than
module) expected by chance
MACS/MACS2 Empirically models ChIP-Seq read length to improve 293
peak prediction, uses a dynamic Poisson distribution
PeakSeq Based on a two-pass strategy to compensate for open 292
chromatin signal
PeakRanger Uses a staged algorithm to discover enriched regions 307
and the summits within them
QuEST A statistical framework based on kernel density 308
estimation
RSEG Especially developed for locating genomic regions 309
associated with histone marks
SICER Uses a clustering approach to identify enriched 310
domains from histone modification ChIP-Seq data
SiSSRs Uses the direction and density of reads and the average 311
DNA fragment length to identify binding sites
SPP Includes binding profile normalization, peak detection, 300
and estimation of read depth to achieve peak saturation
USeq Empirical algorithms to reduce false positives and 312
estimate confidence in ChIP-Seq peaks
ZINBA Models and accounts for factors covarying with 313
background or true signals
ChIP-Seq sample versus the control, and the statistical significance associ-
ated with each enriched target region is calculated from binomial distribu-
tion. The MACS/MACS2 method was one of the earliest developed methods
and a good overall peak caller. It reduces analysis bias through the use of
control data and local statistics and generates an empirical FDR. The find-
Peaks module in HOMER identifies peaks based on the principle that more
sequencing reads are found in these regions than expected by chance. SPP
is an R package designed for analyzing Illumina-generated ChIP-Seq data. It
calculates a smoothed read enrichment profile along the genome and identi-
fies significantly enriched sites compared to input control.
Mapping Protein–DNA Interactions with ChIP-Seq 155
0.5
Score of replicate 1
Irreproducible
signals
0.4
100
IDR
0.3
0.2
50
Reproducible
0.1
signals
0.0
0
FIGURE 11.7
Use of irreproducible discovery rate (IDR) in assessing replicate reproducibility. (a) The distri-
bution of the significance scores of the peaks identified in two replicate experiments. The IDR
method computes the probability of being irreproducible for each peak, and classifies them as
being reproducible (black) or irreproducible (gray). (b) The IDR at different rank thresholds
when the peaks are sorted by the original significance score. (From T Bailey, P Krajewski,
I Ladunga, C Lefebvre, Q Li, T Liu, P Madrigal, C Taslim, J Zhang, Practical guidelines for the
comprehensive analysis of chip-seq data, PLoS Computational Biology 9, 2013:e1003326. Used
under the terms of the Creative Commons Attribution License, http://creativecommons.org
/licenses/by/3.0/, ©2013 Bailey et al.)
156 Next-Generation Sequencing Data Analysis
0.5
0.4
0.3
IDR
0.2
0.1
0
0
00
0
00
00
00
50
25
10
15
20
Number of peaks
FIGURE 11.8
Evaluation of the performance of six peak callers using IDR. (Original study from Y Chen, N
Negre, Q Li, JO Mieczkowska, M Slattery, T Liu, Y Zhang et al., Systematic evaluation of factors
influencing ChIP-seq fidelity, Nature Methods 9, 2012:609–614. With permission.)
MACS as the peak caller using default parameters. As they can vary with the
use of different peak callers and parameters, FRiP values must be derived
from the same algorithm using same parameters in order for them to be
comparable across samples or experiments.
11.3.3 Peak Visualization
Visualizing peaks in their genomic context allows identification of overlap-
ping or nearby functional elements, and thereby facilitates peak annotation
and data interpretation. Many peak callers generate BED files containing
peak chromosomal locations, along with WIG and bedGraph track files,
all of which can be uploaded to a genome browser for peak visualization.
Examination of peak regions in a genome browser and comparison with other
data/annotation tracks allow identification of associated genomic features,
such as promoters, enhancers, and other regulatory regions. BEDTools can
also be applied to explore relationships between peaks and other genomic
landmarks such as nearby protein-coding or noncoding genes.
Xi, j
Zi, j =
∑
N
Xi, j
j=1
where Zi,j and Xi,j are the normalized and original peak signal for sample i
and peak j, and N is the total number of called peaks.
158 Next-Generation Sequencing Data Analysis
TABLE 11.2
Packages Developed for ChIP-Seq Differential Binding Analysis
Name Description Reference
ChIPComp Differential binding analysis taking into consideration controls, 317
signal-to-noise ratios, replicates, and multifactor experimental
design
ChIPDIff Differential histone mark analysis based on Hidden Markov model 318
ChIPnorm Carries out quantile normalization for differential binding sites 298
identification
ChromaSig Performs unsupervised learning to determine significant patterns 319
of chromatin modifications across multiple experiments
DBChIP Identifies differentially bound punctate binding sites in multiple 320
conditions using RNA-Seq differential expression approaches
and accommodates controls
DiffBind Uses statistical tests used in RNA-Seq packages edgeR and DESeq 321
to process peak sets and identify differentially bound regions
diffReps Detects and annotates differential chromatin modification hotspots 295
DIME Differential binding analysis using a finite exponential-normal 322
mixture model
MAnorm Conducts an MA-plot-based normalization prior to quantitative 297
comparison
MMDiff Takes a multivariate nonparametric approach to testing differential 323
binding
Mapping Protein–DNA Interactions with ChIP-Seq 159
packages are designed on certain assumptions and therefore the user needs
to be aware of these assumptions and ensure they are fulfilled prior to using
them. For example, DIME is based on the assumption that a large proportion
of peaks are common across the conditions under comparison.
11.5 Functional Analysis
Often the data gathered from a ChIP-Seq study is used to understand gene
expression regulation and associated biological functions. To conduct func-
tional analysis, peaks are first assigned to nearby genes. While it is debatable
on what genes a peak should be assigned to, a straightforward approach
is to assign it to the closest gene. Once peaks are assigned to target genes,
an integrated analysis of ChIP-Seq and gene expression data (more on this
later) can be carried out. Furthermore, Gene Ontology (GO), biological path-
way, gene network, or gene set enrichment analyses can be conducted using
similar approaches as described in Chapter 7. Prior to carrying out these
gene functional analyses, one should also bear in mind that the peak-to-gene
assignment process is biased by gene size, as the presence of peak(s) has a
positive correlation with the length of a gene. In addition, the distribution
of gene size in different functional annotations such as GO categories is not
uniform, with some categories having an excess number of long genes and
others having more short genes. To solve the problems caused by different
gene size, methods that adjust for the effects of gene size should be used,
such as ChIP-Enrich [324].
11.6 Motif Analysis
One of the goals of ChIP-Seq data analysis is to identify DNA-binding motifs
for the protein of interest. A DNA-binding motif is usually represented by
a consensus sequence, or more accurately, a position-specific frequency
matrix. Figure 11.9a shows an example of such a DNA-binding motif, the one
bound by a previously introduced transcription factor NRF2 (see Chapter 2).
To identify motifs from ChIP-Seq data, all peak sequences need to be assem-
bled and fed into multiple motif discovery tools. Some of the commonly
used motif discovery tools are Cistrome [325], Gibbs motif sampler (part of
CisGenome), HOMER (findMotifs module), MEME-ChIP [326], QuEST [308],
RSAT peak-motifs [327], and ChIPMunk [328]. The motif discovery phase
usually ends up with one or more motifs, with one being the binding site
of the target protein and others being that of its partners. The discovered
160 Next-Generation Sequencing Data Analysis
Consensus
2.0
Bits
1.0
0.0
(a)
De novo motif
2.0
Bits
1.0
0.0
(b)
FIGURE 11.9
The consensus DNA binding motif of the transcription factor NRF2. (a) The currently known
NRF2-binding motif. (b) The result of a de novo motif analysis using NRF2 ChIP-Seq data.
(From BN Chorley, MR Campbell, X Wang, M Karaca, D Sambandan, F Bangura, P Xue, J Pi,
SR Kleeberger, DA Bell, Identification of novel NRF2-regulated genes by ChIP-Seq: Influence
on retinoid X receptor alpha, Nucleic Acids Research 2012, 40(15):7416–7429. With permission.)
a genome and the host cell. As a good example, such an integrated analysis
has led to the discovery of a large number of chromatin states, each of which
display distinct sequence motifs and functional characteristics [333]. The dis-
covery of these chromatin states was achieved with the use of a multivariate
hidden Markov model on a large collection of ChIP-Seq data, generated for
38 different histone methylation and acetylation marks, H2AZ (a variant of
histone H2A), RNA polymerase II, and CTCF (a transcriptional repressor).
Besides meta-analysis of multiple ChIP-Seq data sets, integrated analysis of
ChIP-Seq with other genomics data, such as RNA-Seq data, offers further
information on genome function and regulation. The majority of protein fac-
tors used in various ChIP-Seq studies are transcription factors and histones
that carry a large array of modified marks, all of which are key regulators of
genome transcription. Coupled analysis of matched ChIP-Seq and RNA-Seq
data augments the utility of both data types, and provides new insights that
cannot be obtained from analyzing either data type alone. To carry out inte-
grative analysis of ChIP-Seq and RNA-Seq data, Bayesian mixed models [334]
can be applied. In addition, tools such as CEAS [314] and ChIPpeakAnno
[315] can also be used to help investigate the correlation between the DNA
binding profile and regulation of nearby gene transcription.
12
Epigenomics and DNA Methylation Analysis
by Next-Generation Sequencing (NGS)
163
164 Next-Generation Sequencing Data Analysis
(1) Denaturation
FIGURE 12.1
Major steps of bisulfite sequencing. Prior to bisulfite treatment, the two strands of DNA are
first separated by denaturation. The bisulfite treatment then converts unmethylated, but not
methylated, cytosines to uracils. The two strands from the treatment, BSW and BSC, are then
subjected to PCR amplification. This leads to the generation of four strands (BSW, BSWR, BSC,
and BSCR), all of which are distinct from the original Watson and Crick strands. (From Y Xi,
W Li, BSMAP: Whole genome bisulfite sequence MAPping program, BMC Bioinformatics 2009,
10:232. Used under the terms of the Creative Commons Attribution License, http://creative
commons.org/licenses/by/2.0, ©2009 Xi and Li.)
Epigenomics and DNA Methylation Analysis by NGS 165
genome and DNA methylation site identification. The general data QC steps
detailed in Chapter 5 should be performed for their removal. Other QC steps
include adapter trimming, as some sequencing reactions may run through
DNA inserts into adapters. In addition, for MspI-digested RRBS libraries, the
DNA fragment end repair step during the library construction artificially
introduces two bases (an unmethylated cytosine and a guanine) to both
ends, both of which should be trimmed off as well. Tools such as Trim Galore
(a wrapper tool using Cutadapt and FastQC) [348] can be used for these
trimming steps, especially removing the two artificially introduced bases
in RRBS reads derived from MspI digestion. Besides these general-purpose
QC tools, some packages designed for bisulfite sequencing reads processing,
including BSmooth [349] and WBSA [350], also contain QC modules.
12.2.2 Read Mapping
In order to identify methylated DNA sites, sequencing reads derived from
bisulfite conversion or methylated DNA enrichment need to be first mapped
to the reference genome. Mapping of reads generated from the enrichment-
based methods is rather straightforward, and like mapping ChIP-Seq reads,
is usually conducted with general aligners, such as Bowtie, BWA, or SOAP.
Mapping of bisulfite sequencing reads, however, is less straightforward. This
is because through the bisulfite conversion and the subsequent sequencing
process, a converted unmethylated cytosine is read as a thymine (T), or an
adenine (A) on the complementary strand, whereas a methylated cytosine
remains as a cytosine (C), or a guanine (G) on the complementary strand (see
Figure 12.1). The conversion therefore has several implications for the read
mapping process:
There are two general strategies for mapping bisulfite sequencing reads:
(1) replacing all C’s in the reference genome with the wild-card letter Y to
168 Next-Generation Sequencing Data Analysis
match both C’s and T’s in the reads; and (2) converting all C’s in the refer-
ence sequence and reads to T’s, and then aligning with a seed-and-extend
approach. Aligners that use the wild-card approach include BSMAP [351],
Pash [352], and RRBSMAP (a version of BSMAP specifically tailored for RRBS
reads) [353]. In the example of BSMAP, it uses SOAP for carrying out read
alignment, and deploys genome hashing and bitwise masking for speed and
accuracy. BSMAP indexes the reference genome using a hash table contain-
ing original reference seed sequences and all their possible bisulfite conver-
sion variants through the replacement of C’s with T’s. After determining the
potential genomic position of each read by looking up the hash table, for the
T’s in each bisulfite read that are mapped to reference genome position(s)
where the original reference bases are C’s, BSMAP masks as C’s. The masked
bisulfite reads are then mapped again to the reference genome.
Aligners such as BatMeth [354], Bismark [355], BRAT-BW [356], BS-Seeker/
BS-Seeker2 [357,358], and MethylCoder [359] use the other three-letter
approach. Among these aligners, Bismark is commonly used. To carry out
its alignment (illustrated in Figure 12.2), Bismark first converts C’s in the
reads into T’s, and G’s into A’s (equivalent of the C-to-T conversion on the
Genomic fragment
Sequence after bisulfite
treatment
Read conversion
(1) (2)
Align to bisulfite
converted genomes
(3) (4)
Forward strand C-to-T converted genome Forward strand G-to-A converted genome
FIGURE 12.2
The “three-letter” bisulfite sequencing read alignment approach used by Bismark. (Adapted
from F Krueger, SR Andrews, Bismark: A flexible aligner and methylation caller for Bisulfite-
Seq applications, Bioinformatics 2011, 27:1571–1572. With permission.)
Epigenomics and DNA Methylation Analysis by NGS 169
bisulfite conversion from genetic variants. The use of sequence reads from
the complementary strand makes this possible, because a T produced from
bisulfite conversion will have a G on the opposite strand, whereas a C → T
SNP will have an A on the other strand.
Different from the bisulfite-conversion-based sequencing methods, the
methylated DNA enrichment sequencing approaches such as MeDIP-Seq
and MBD-Seq cannot quantify methylation at the single-nucleotide resolu-
tion. In addition, the absolute levels of DNA methylation cannot be obtained
from the enrichment-based methods, as the sequence read counts from
these methods are a function of both absolute DNA methylation levels and
regional CpG content. Since these approaches are based on affinity immuno-
precipitation and more similar to ChIP-Seq, analytical methods developed
for ChIP-Seq data analysis, including background determination, normal-
ization, and peak detection, can be applied for quantification of DNA meth-
ylation by these approaches. As an output, the degree of DNA methylation
can be summarized as coverage over a predefined region, such as per gene,
promoter, or certain-sized bin.
track type=bedGraph
chr19 45408804 45408805 1.0
chr19 45408806 45408807 0.75
chr19 45408854 45408855 0.3
chr19 45408855 45408856 0.5
FIGURE 12.3
An example of the bedGraph file format. It includes a track definition line (the first line), fol-
lowed by track data lines in a four-column format (i.e., chromosome, chromosome start posi-
tion, chromosome end position, and data value).
1 60
0 0
1 60
Forward 1
H1 ESC
methylC-Seq 0
Max read depth: 30
Reverse 1
HOXA1 HOXA3 HOXA4 HOXA5 HOXA7 HOXA10 HOXA13
HOXA2 HOXA6 HOXA9 HOXA11
chr7 p21.3 p21.1 p15.3 p14.1 p13 q11.22q11.23 q21.11 q21.3 q22.1 q31.1 q33 q34 q35 q36.1
Epigenomics and DNA Methylation Analysis by NGS
FIGURE 12.4
Visualization of DNA methylation data in a genome browser. Shown here is the methylC track in the Washington University EpiGenome Browser
for a region of the human chromosome 7 where the HOXA gene cluster is located. The original WGBS data was collected from H1 human embryonic
stem cells. Both DNA methylation levels (represented by vertical bars) and read depth (the smoothed curve) are displayed in a strand-specific fashion.
A close-up view of the boxed region is shown on the top, with the left axis marking the DNA methylation level and the right marking read depth.
(Modified from X Zhou, D Li, RF Lowdon, JF Costello, T Wang, methylC Track: Visual integration of single-base resolution DNA methylation data on
the WashU EpiGenome Browser, Bioinformatics 2014, 30:2206–2207. With permission.)
171
172 Next-Generation Sequencing Data Analysis
175
176 Next-Generation Sequencing Data Analysis
Compared to the NGS data generated from a single species (most chap-
ters in this book deal with individual species), the metagenomics data from
microbial community sequencing is much more complicated. Each metage-
nome contains DNA sequences from a large but unknown number of spe-
cies, including viruses, bacteria, archaea, fungi, and microscopic eukaryotes.
To further complicate the situation, the relative abundance of these species
varies widely. In comparison to sequencing reads collected from a single
species, metagenomic sequencing reads contain much higher heterogeneity
because of the tremendous genome diversity in each microbial community.
Also because of the tremendous DNA sequence complexity contained in the
metagenome, most metagenomic sequencing efforts can only sample part of
the DNA pool. As a result of this limited sampling in a highly diverse DNA
space, metagenomic NGS data is highly fragmented and has low redundancy.
Due to the lack of redundant (i.e., partially overlapping, not duplicate) reads,
metagenomic NGS data has an inherently higher error rate when compare to
single-genome sequencing. All these differences between metagenomic and
monogenomic NGS data require an entirely different set of tools for NGS-
based metagenome data analysis for microbial community structural and
functional profiling.
13.2 Sequencing Approaches
There are several key factors that need to be considered before the sequenc-
ing process starts. These include sequencing depth, read length, and sequenc-
ing platforms. The depth of sequencing is dependent on the goal to be
pursued. Studies that attempt to locate rare members of microbial commu-
nities require deeper sequencing than those that are only focused on more
abundant members. With regard to read length, longer reads are always bet-
ter than shorter reads in metagenomics for sorting out the inherent sequence
complexity. The read length from the commonly used Illumina HiSeq sys-
tem can currently read 125 bp from one end using the high-output mode,
and can generate ~450 bp reads if using partially overlapped paired-end
sequencing during the rapid run mode (see Chapter 10). As overviewed in
Chapter 4, other technologies, such as Pacific Biosciences’s SMRT (single
molecule real-time) and the 454 pyrosequencing, generate longer (but fewer)
reads. A hybrid approach is often used to take advantage of the different
Metagenome Analysis by Next-Generation Sequencing (NGS) 179
Sample processing
and sequencing
Reads binning
Metagenome assembly
Reads mapping to currently
known gene sequences
Contig binning
Calling of ORFs
and genomic elements
Phylogenetic analysis
Metabolic pathway
reconstruction
Comparative metagenomic
analysis
FIGURE 13.1
Major steps of metagenome analysis.
Metagenome Analysis by Next-Generation Sequencing (NGS) 181
Genovo [382], and Xgenovo [383] can be used. For the relatively short Illumina
reads, more assemblers are currently available, including MetaVelvet/
MetaVelvet-SL [384,385], meta-IDBA/IDBA-UD [386,387], GeneStitch [388],
Ray Meta [389], and Omega [390]. Similar to single-genome assemblers, many
of these short-read metagenome assemblers are based on the de Bruijn graph
approach (see Chapter 10). The difference from the single-genome assem-
blers, though, is that they attempt to identify subgraphs within a mixed de
Bruijn graph, each of which is expected to represent an individual genome.
For example, MetaVelvet first builds a large mixed de Bruijn graph from meta
genomic reads, which is then decomposed into individual subgraphs.
After the assembling process, a metagenome usually comprises mostly
small contigs. To evaluate the assembly quality, traditional evaluation met-
rics, such as N50, are not as informative and representative as in evaluat-
ing single-genome assemblies. Instead, aggregate statistics such as the total
number of contigs, and the maximum, median, and average length of the
contigs are often used. Further inspection of the assembly quality includes
looking for chimeric assemblies. While there are currently no tools available
to detect chimeric assembly, the assemblies should nevertheless be checked
by looking for signs of chimeric assembly, such as sudden changes in cover-
age, G/C content, and codon usage (different species have different codon
usage patterns). The use of paired-end reads and a higher sequence match
threshold helps reduce the rate of chimeric assemblies.
After contig assembly, if paired reads are available, metagenome scaffolds
can be built from the contigs. Many of the metagenome assemblers have a
module to carry out scaffolding. Besides these modules, dedicated metage-
nome scaffolding tools like Bambus 2 [277] may be used to determine if
additional scaffolding is needed. Bambus 2 accepts contigs constructed with
most assemblers using reads from all sequencing platforms. In the process
of building scaffolds from contigs, ambiguous and inconsistent contigs may
also be identified.
13.5.2 Sequence Binning
Metagenomic sequence binning refers to the process of grouping sequence
fragments in the mixture and placing them into different “bins” correspond-
ing to their taxonomic origins. This process can be conducted on both assem-
bled and unassembled reads. With longer reads or contigs, high-resolution
binning can be achieved at the levels of family or genus. Short sequences may
be binned only to the level of phylum due to the limited information carried
in the sequences. Since it reduces the complexity inherent in the metagenom-
ics data, each set of binned sequences can also be subjected to independent
analysis in other steps. For example, assembly can be performed postbinning
on each binned sequence set to improve performance.
Three binning approaches are usually used: those based on sequence com-
position, homology, and fragment recruitment. Composition-based binning
Metagenome Analysis by Next-Generation Sequencing (NGS) 183
is not very well suited to study microbial communities that contain many
unknown species.
dividing the raw count of reads assigned to a certain species or OTU by the
total number of reads in the same sample. Another approach is cumulative-
sum scaling (CSS), which, similar to the upper quartile approach in RNA-
Seq, is calculated by dividing the raw count of reads assigned to a species
or OTU by the cumulative sum of counts up to a certain percentile. In a cur-
rently available study [429], CSS performed better than other normalization
approaches including TSS.
191
192 Next-Generation Sequencing Data Analysis
FIGURE 14.1
Third-generation single DNA/RNA molecule sequencing by measuring tunneling currents
generated from the passing of a DNA/RNA molecule through a pair of nanoelectrodes with
a subnanometer gap. (From T Ohshiro, K Matsubara, M Tsutsui, M Furuhashi, M Taniguchi,
T Kawai, Single-molecule electrical random resequencing of DNA and RNA, Scientific Reports
2012, 2:501. With permission.)
• Single DNA (or RNA) molecule sequencing, that is, the ability to
directly read individual target DNA molecules without relying on
polymerase chain reaction (PCR) amplification or conversion to
cDNA in the case of RNA
• Much improved read length
What Is Next for Next-Generation Sequencing (NGS)? 193
The increased sensitivity that leads to the achievement of single DNA (or
RNA) molecule sequencing makes it possible to directly sequence the genome
or transcriptome of a single cell without any amplification. The reduced
equipment size and increased affordability makes high-throughput sequenc-
ing more accessible to individual research and clinical labs, instead of being
mostly limited to large genome centers or core facilities. The change in read
length, and other aspects of sequence read output such as error model, also
drives further evolution of bioinformatic tools.
FIGURE 14.2
The increase in the number of publications from 2010 to 2014 related to the development and
application of RNA-Seq algorithms. (From Google Scholar.)
What Is Next for Next-Generation Sequencing (NGS)? 195
14.4 Parallel Computing
Parallelization, a computation term that describes splitting of a task into a
number of independent subtasks, can significantly increase the processing
speed of highly parallelizable tasks, which include many NGS data analysis
steps. For example, although millions of reads are generated from a sequenc-
ing run, mapping of these reads to a reference genome is a process that is
“embarrassingly parallel,” as each read is mapped independently to the
reference. As parallel computing can be efficiently carried out by graphics
processing units (GPUs) since rendering of each pixel on a computer screen
is also a highly parallel process, the integration of GPUs with CPUs in het-
erogeneous computing systems can increase throughput ten- to hundred-
fold, and turn individual computers into mini-supercomputers. While these
systems can be applied to various aspects of NGS data analysis, many NGS
analytical tools have yet to take full advantage of the power of parallel com-
puting in such systems.
Parallelization is also an important factor in determining how an increase
in the number of CPU (or GPU) cores might affect actual NGS data pro-
cessing performance. If a step is highly parallelizable, and the algorithm
designed for it employs parallelization, then an increase in core number will
most likely lead to improved performance. On the contrary, if the step is not
196 Next-Generation Sequencing Data Analysis
14.5 Cloud Computing
Because the rates at which NGS technologies advance and sequencing costs
drop are faster than those of development in the computer hardware indus-
try and the resultant increase in computing power (i.e., Moore’s law), the gap
between NGS data generation and their computational analysis will only
widen. To narrow this gap and speed up NGS data processing, the NGS com-
munity has begun to embrace a trend that has been taking place in comput-
ing resource distribution from the long-existing model of local computing to
cloud computing. Companies such as Amazon, Microsoft, and Google have
been building megascale cloud computing clusters and data storage systems
for end users to use over the Internet. Compared to local computing, cloud
computing enables access to supercomputing and mass data storage capa-
bilities without the need to build and maintain a local workstation, server, or
high-performance computing cluster.
At the core of cloud computing is virtualization technology, which allows
an end user to create a virtual computer system on demand with the flex-
ibility of specifying the number of CPU cores, memory size, disk space, and
operating system that are required for a job. With this technology, multiple
virtual computer systems can be run simultaneously on the same physical
cloud server. The adoption of cloud computing for NGS data processing has
demonstrated the advantages of this “supercomputing-on-demand” model,
which include flexibility, scalability, and oftentimes cost-savings. The flexibil-
ity and scalability offered by cloud computing allow a researcher to conduct
NGS data analysis using supercomputing capabilities that previously only
existed in large genome centers. Cost savings are achieved as the user only
needs to pay for the time used by the user-configured computing instance.
Another advantage of using the cloud is with data sharing among research-
ers and projects. By providing single, centralized data storage, the cloud
enables different groups located in different geographical locations to have
access to the same data sets and share analytical results. Furthermore, with
cloud computing, the task of bringing software tools to the “big” NGS data
can be more readily realized. In contrast to the large sizes of NGS data
files, the software and scripts designed to process them are much smaller.
Therefore, it is much easier and more efficient to download and install them
to wherever the data is stored, rather than moving or replicating the high
volumes of NGS data to where the tools are installed. By directly storing
production data in the cloud, the burden of data transfer is greatly reduced;
What Is Next for Next-Generation Sequencing (NGS)? 197
by coupling data and tools in the same place, optimal performance can be
achieved.
While cloud computing enables users to offload the hassle and cost of run-
ning and maintaining a local computing system, it does have downsides that
need to be considered. One of the practical barriers of moving to the cloud
is the speed of data transfer into and out of the cloud. It may take a week to
upload 100 GB of data to the cloud using low-speed Internet connections.
The question of whether to run analysis in the cloud is heavily dependent
on the amount of data to be transferred and the computational complexity
of the analytical steps. As a general rule, it is only worthwhile to upload
data to the cloud for processing when the analytical task requires more than
105 CPU cycles per byte of data [441]. So for projects that deal with large
amounts of data but do not involve a lot of highly intensive computational
steps, more time may be spent on data transfer to the cloud rather than data
processing. Other potential factors include data security, cost ineffectiveness
under some circumstances, availability of analytical tools in the cloud envi-
ronment, and network downtime. Although users can access their data from
anywhere on the Internet, the convenience also means the possibility of data
security being breached or compromised. Some heavy users may find cloud
computing not as cost-effective as running a local server. While more tools
are becoming available in the cloud, users still need to use due diligence
to make sure that the tools they need are available. For users at places that
suffer frequent network outages, cloud computing can be problematic as all
cloud-based operations are dependent on Internet traffic.
Despite the potential downsides, cloud computing has been proven to
be a viable approach for NGS data analysis. Table 14.1 is a list of some of
the current cloud-computing providers that can be used for NGS applica-
tions. To illustrate how cloud computing can be deployed for analyzing NGS
data, following is an example on the conduct of reads alignment using the
Amazon Elastic Compute Cloud (Amazon EC2). As the first step, input data
files (FASTQ files and a reference genome file) are uploaded from a local
computer to a “bucket” in the Amazon S3 cloud storage. This bucket, which
is also used to hold program scripts and output files, can be created with the
TABLE 14.1
Providers of Cloud Computing That Can Be Used for NGS Data Analysis
Provider URL
Amazon Elastic Compute Cloud http://aws.amazon.com/ec2/
Rackspace http://www.rackspace.com
Bionimbus http://bionimbus.opensciencedatacloud.org
Open Cloud Consortium (not-for-profit) http://opencloudconsortium.org
Microsoft Azure http://azure.microsoft.com/
Google Cloud https://cloud.google.com
198 Next-Generation Sequencing Data Analysis
BAM: A file format for storing reads alignment data. It is the binary version
of the SAM format (see SAM). Compared to its equivalent SAM file,
a BAM file is considerably smaller in size and much faster to load.
Unlike SAM files, however, the BAM format is not human-readable.
BAM files have a file extension of .bam. Some tools require BAM
files to be indexed. Besides the .bam file, an indexed BAM file also
has a companion index file of the same name but with a different file
extension (.bai).
BCF: Binary VCF (see VCF). While it is equivalent to VCF, BCF is much
smaller in file size due to compression, and therefore achieves high
efficiency in file transfer and parsing.
BCL: Binary base call files generated from Illumina’s proprietary base call-
ing process.
BED: Browser Extensible Display format used to describe genes or other
genomic features in a genome browser. It is a tab-delimited text for-
mat that defines how genes or genomic features are displayed as an
annotation track in a genome browser such as the UCSC Genome
Browser. Each entry line contains three mandatory fields (chrom,
chromStart, and chromEnd, specifying for each genomic feature the
particular chromosome it is located on and the start and end coor-
dinates) and nine optional fields. Binary PED files (see PED) are also
referred to as BED files, but this is a totally different file format.
bedGraph: Similar to the BED format, bedGraph provides descriptions of
genomic features for their display in a genome browser. Distinctively
the bedGraph format allows display of continuous values, such as
probability scores or coverage depth, in a genome.
bigBed: A format similar to BED, but bigBed files are binary, compressed,
and indexed. Display of bigBed files in a genome browser is sig-
nificantly faster due to the compression and indexing, which allow
transmittal of only the part of the file that is needed for the current
view instead of the entire file.
bigWig: A format for visualization of dense, continuous data, such as GC
content, in a genome browser. A newer format than the WIG format
199
200 Appendix A
WIG: Wiggle Track Format. It is used for displaying continuous data tracks,
such as GC content, in a genome viewer such as the UCSC Genome
Browser. The WIG format is similar to the bedGraph format (see bed-
Graph), but a major difference between the two is that data exported
from a WIG track is not as well preserved as that from a bedGraph
track. The WIG format can be converted to bigWig (see bigWig) for
improved performance.
Appendix B:
Glossary
203
204 Appendix B
1. Vale RD. The molecular motor toolbox for intracellular transport. Cell 2003,
112:467–480.
2. Cavalier-Smith T. The simultaneous symbiotic origin of mitochondria, chloro-
plasts, and microbodies. Ann NY Acad Sci 1987, 503:55–71.
3. Lopez de Heredia M, Jansen RP. mRNA localization and the cytoskeleton. Curr
Opin Cell Biol 2004, 16:80–85.
4. Hirokawa N. mRNA transport in dendrites: RNA granules, motors, and tracks.
J Neurosci 2006, 26:7139–7142.
5. Mayer F. Cytoskeletons in prokaryotes. Cell Biol Int 2003, 27:429–438.
6. Bender A, Krishnan KJ, Morris CM, Taylor GA, Reeve AK, Perry RH, Jaros E
et al. High levels of mitochondrial DNA deletions in substantia nigra neurons
in aging and Parkinson disease. Nat Genet 2006, 38:515–517.
7. Corral-Debrinski M, Shoffner JM, Lott MT, Wallace DC. Association of mito-
chondrial DNA damage with aging and coronary atherosclerotic heart disease.
Mutat Res 1992, 275:169–180.
8. Santos RX, Correia SC, Zhu X, Smith MA, Moreira PI, Castellani RJ, Nunomura
A, Perry G. Mitochondrial DNA oxidative damage and repair in aging and
Alzheimer’s disease. Antioxid Redox Signal 2013, 18:2444–2457.
9. Greaves LC, Reeve AK, Taylor RW, Turnbull DM. Mitochondrial DNA and dis-
ease. J Pathol 2012, 226:274–286.
10. Green BR. Chloroplast genomes of photosynthetic eukaryotes. Plant J 2011,
66:34–44.
11. Harris SA, Ingram R. Chloroplast DNA and biosystematics: The effects of intra-
specific diversity and plastid transmission. Taxon 1991:393–412.
12. Roy U, Grewal RK, Roy S. Complex networks and systems biology. In Systems
and Synthetic Biology. Springer; 2015:129–150.
13. Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD,
Bult CJ et al. The minimal gene complement of Mycoplasma genitalium. Science
1995, 270:397–403.
14. Glass JI, Assad-Garcia N, Alperovich N, Yooseph S, Lewis MR, Maruf M,
Hutchison CA, 3rd, Smith HO, Venter JC. Essential genes of a minimal bacte-
rium. Proc Natl Acad Sci USA 2006, 103:425–430.
15. Nakabachi A, Yamashita A, Toh H, Ishikawa H, Dunbar HE, Moran NA, Hattori
M. The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science
2006, 314:267.
16. Colbourne JK, Pfrender ME, Gilbert D, Thomas WK, Tucker A, Oakley TH,
Tokishita S et al. The ecoresponsive genome of Daphnia pulex. Science 2011,
331:555–561.
17. Shapiro JA, von Sternberg R. Why repetitive DNA is essential to genome func-
tion. Biol Rev Camb Philos Soc 2005, 80:227–250.
18. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L et al.
Analysis of genetic inheritance in a family quartet by whole-genome sequenc-
ing. Science 2010, 328:636–639.
213
214 References
19. Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and geno-
typing. Nat Rev Genet 2011, 12:363–376.
20. Malnic B, Godfrey PA, Buck LB. The human olfactory receptor gene family. Proc
Natl Acad Sci USA 2004, 101:2584–2589.
21. Inai Y, Ohta Y, Nishikimi M. The whole structure of the human nonfunctional
L-gulono-gamma-lactone oxidase gene—the gene responsible for scurvy—and
the evolution of repetitive sequences thereon. J Nutr Sci Vitaminol (Tokyo) 2003,
49:315–319.
22. Law JA, Jacobsen SE. Establishing, maintaining and modifying DNA methyla-
tion patterns in plants and animals. Nat Rev Genet 2010, 11:204–220.
23. Cedar H, Bergman Y. Linking DNA methylation and histone modification:
Patterns and paradigms. Nat Rev Genet 2009, 10:295–304.
24. Guo W, Chung WY, Qian M, Pellegrini M, Zhang MQ. Characterizing the
strand-specific distribution of non-CpG methylation in human pluripotent
cells. Nucleic Acids Res 2014, 42:3009–3016.
25. Wu H, Zhang Y. Reversing DNA methylation: Mechanisms, genomics, and bio-
logical functions. Cell 2014, 156:45–68.
26. Bertram L, McQueen MB, Mullin K, Blacker D, Tanzi RE. Systematic meta-
analyses of Alzheimer disease genetic association studies: The AlzGene data-
base. Nat Genet 2007, 39:17–23.
27. Baylin SB, Jones PA. A decade of exploring the cancer epigenome—Biological
and translational implications. Nat Rev Cancer 2011, 11:726–734.
28. Ehrlich M. DNA hypomethylation in cancer cells. Epigenomics 2009, 1:239–259.
29. Serganov A, Nudler E. A decade of riboswitches. Cell 2013, 152:17–24.
30. Ray PS, Jia J, Yao P, Majumder M, Hatzoglou M, Fox PL. A stress-responsive
RNA switch regulates VEGFA expression. Nature 2009, 457:915–919.
31. Wan Y, Kertesz M, Spitale RC, Segal E, Chang HY. Understanding the tran-
scriptome through RNA structure. Nat Rev Genet 2011, 12:641–655.
32. Imashimizu M, Oshima T, Lubkowska L, Kashlev M. Direct assessment of tran-
scription fidelity by high-resolution RNA sequencing. Nucleic Acids Res 2013,
41:9090–9104.
33. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF,
Schroth GP, Burge CB. Alternative isoform regulation in human tissue tran-
scriptomes. Nature 2008, 456:470–476.
34. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splic-
ing complexity in the human transcriptome by high-throughput sequencing.
Nat Genet 2008, 40:1413–1415.
35. Keegan LP, Gallo A, O’Connell MA. The many roles of an RNA editor. Nat Rev
Genet 2001, 2:869–878.
36. Bratt E, Ohman M. Coordination of editing and splicing of glutamate receptor
pre-mRNA. RNA 2003, 9:309–318.
37. Pfeiffer BE, Huber KM. Current advances in local protein synthesis and synap-
tic plasticity. J Neurosci 2006, 26:7147–7150.
38. Rustad TR, Minch KJ, Brabant W, Winkler JK, Reiss DJ, Baliga NS, Sherman
DR. Global analysis of mRNA stability in Mycobacterium tuberculosis. Nucleic
Acids Res 2013, 41:509–517.
39. Sharova LV, Sharov AA, Nedorezov T, Piao Y, Shaik N, Ko MS. Database for
mRNA half-life of 19 977 genes obtained by DNA microarray analysis of pluri
potent and differentiating mouse embryonic stem cells. DNA Res 2009, 16:45–58.
References 215
102. Zhulidov PA, Bogdanova EA, Shcheglov AS, Vagner LL, Khaspekov GL,
Kozhemyako VB, Matz MV et al. Simple cDNA normalization using kamchatka
crab duplex-specific nuclease. Nucleic Acids Res 2004, 32:e37.
103. Gallego Romero I, Pai AA, Tung J, Gilad Y. RNA-seq: Impact of RNA degrada-
tion on transcript quantification. BMC Biol 2014, 12:42.
104. Yang L, Duff MO, Graveley BR, Carmichael GG, Chen LL. Genomewide charac-
terization of non-polyadenylated RNAs. Genome Biol 2011, 12:R16.
105. Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data.
Genetics 2010, 185:405–416.
106. Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. Scotty: A web tool
for designing RNA-Seq experiments to measure differential gene expression.
Bioinformatics 2013, 29:656–657.
107. Wang Y, Ghaffari N, Johnson CD, Braga-Neto UM, Wang H, Chen R, Zhou H.
Evaluation of the coverage and depth of transcriptome by RNA-Seq in chick-
ens. BMC Bioinformatics 2011, 12 Suppl 10:S5.
108. Vijay N, Poelstra JW, Kunstner A, Wolf JB. Challenges and strategies in tran-
scriptome assembly and differential gene expression quantification. A compre-
hensive in silico assessment of RNA-seq experiments. Mol Ecol 2013, 22:620–634.
109. Liu Y, Ferguson JF, Xue C, Silverman IM, Gregory B, Reilly MP, Li M. Evaluating
the impact of sequencing depth on transcriptome profiling in human adipose.
PLoS One 2013, 8:e66883.
110. Ching T, Huang S, Garmire LX. Power analysis and sample size estimation for
RNA-Seq differential expression. RNA 2014, 20:1684–1696.
111. DeLuca DS, Levin JZ, Sivachenko A, Fennell T, Nazaire MD, Williams C, Reich
M, Winckler W, Getz G. RNA-SeQC: RNA-seq metrics for quality control and
process optimization. Bioinformatics 2012, 28:1530–1532.
112. Wang L, Wang S, Li W. RSeQC: Quality control of RNA-seq experiments.
Bioinformatics 2012, 28:2184–2185.
113. Tang S, Riva A. PASTA: Splice junction identification from RNA-sequencing
data. BMC Bioinformatics 2013, 14:116.
114. Chen LY, Wei KC, Huang AC, Wang K, Huang CY, Yi D, Tang CY, Galas DJ,
Hood LE. RNASEQR—A streamlined and accurate RNA-seq sequence analysis
program. Nucleic Acids Res 2012, 40:e42.
115. Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert
CJ, Hogenesch JB, Pierce EA. Comparative analysis of RNA-Seq alignment
algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics 2011,
27:2518–2528.
116. Xu G, Deng N, Zhao Z, Judeh T, Flemington E, Zhu D. SAMMate: A GUI tool for
processing short read alignments in SAM/BAM format. Source Code Biol Med
2011, 6:2.
117. Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN. SpliceSeq: A resource
for analysis and visualization of RNA-Seq data on alternative splicing and its
functional impacts. Bioinformatics 2012, 28:2385–2387.
118. Trapnell C, Pachter L, Salzberg SL. TopHat: Discovering splice junctions with
RNA-Seq. Bioinformatics 2009, 25:1105–1111.
119. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2:
Accurate alignment of transcriptomes in the presence of insertions, deletions
and gene fusions. Genome Biol 2013, 14:R36.
References 219
120. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X et al.
MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery.
Nucleic Acids Res 2010, 38:e178.
121. Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctions from
paired-end RNA-seq data by SpliceMap. Nucleic Acids Res 2010, 38:4570–4578.
122. Dimon MT, Sorber K, DeRisi JL. HMMSplicer: A tool for efficient and sensitive
discovery of known and novel splice junctions in RNA-Seq data. PLoS One 2010,
5:e13875.
123. Marco-Sola S, Sammeth M, Guigo R, Ribeca P. The GEM mapper: Fast, accurate
and versatile alignment by filtration. Nat Methods 2012, 9:1185–1188.
124. Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splic-
ing in short reads. Bioinformatics 2010, 26:873–881.
125. Bao H, Xiong Y, Guo H, Zhou R, Lu X, Yang Z, Zhong Y, Shi S. MapNext: A
software tool for spliced and unspliced alignments and SNP detection of short
sequence reads. BMC Genomics 2009, 10 Suppl 3:S13.
126. Ameur A, Wetterbom A, Feuk L, Gyllensten U. Global and unbiased detection
of splice junctions from RNA-seq data. Genome Biol 2010, 11:R34.
127. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson
M, Gingeras TR. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics
2013, 29:15–21.
128. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: Robust de novo RNA-seq
assembly across the dynamic range of expression levels. Bioinformatics 2012,
28:1086–1092.
129. Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, Huang W et al. SOAPdenovo-Trans:
De novo transcriptome assembly with short RNA-Seq reads. Bioinformatics
2014, 30:1660–1666.
130. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K et al.
De novo assembly and analysis of RNA-seq data. Nat Methods 2010, 7:909–912.
131. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X
et al. Full-length transcriptome assembly from RNA-Seq data without a refer-
ence genome. Nat Biotechnol 2011, 29:644–652.
132. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods
for normalization and differential expression in mRNA-Seq experiments. BMC
Bioinformatics 2010, 11:94.
133. Law CW, Chen Y, Shi W, Smyth GK. Voom: Precision weights unlock linear
model analysis tools for RNA-seq read counts. Genome Biol 2014, 15:R29.
134. Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-
Seq data. BMC Bioinformatics 2011, 12:480.
135. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assess-
ment of technical reproducibility and comparison with gene expression arrays.
Genome Res 2008, 18:1509–1517.
136. Anders S, Huber W. Differential expression analysis for sequence count data.
Genome Biol 2010, 11:R106.
137. Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and
false discovery rate estimation for RNA-sequencing data. Biostatistics 2012,
13:523–538.
138. Hardcastle TJ, Kelly KA. baySeq: Empirical Bayesian methods for identifying
differential expression in sequence count data. BMC Bioinformatics 2010, 11:422.
220 References
139. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ,
Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by
RNA-Seq reveals unannotated transcripts and isoform switching during cell
differentiation. Nat Biotechnol 2010, 28:511–515.
140. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L.
Differential analysis of gene regulation at transcript resolution with RNA-seq.
Nat Biotechnol 2013, 31:46–53.
141. Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: An R package for iden-
tifying differentially expressed genes from RNA-seq data. Bioinformatics 2010,
26:136–138.
142. Love MI, Huber W, Anders S. Moderated estimation of fold change and disper-
sion for RNA-Seq data with DESeq2. bioRxiv 2014.
143. Robinson MD, McCarthy DJ, Smyth GK. edgeR: A Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatics
2010, 26:139–140.
144. Li J, Tibshirani R. Finding consistent patterns: A nonparametric approach for
identifying differential expression in RNA-Seq data. Stat Methods Med Res 2013,
22:519–536.
145. Audic S, Claverie JM. The significance of digital gene expression profiles.
Genome Res 1997, 7:986–995.
146. Katz Y, Wang ET, Airoldi EM, Burge CB. Analysis and design of RNA sequencing
experiments for identifying isoform regulation. Nat Methods 2010, 7:1009–1015.
147. Wu J, Akerman M, Sun S, McCombie WR, Krainer AR, Zhang MQ. SpliceTrap:
A method to quantify alternative splicing under single cellular conditions.
Bioinformatics 2011, 27:3010–3016.
148. Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts
in annotated genomes using RNA-Seq. Bioinformatics 2011, 27:2325–2329.
149. Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of next-
generation mRNA sequencing (RNA-Seq) data for isoform discovery and
abundance estimation. Proc Natl Acad Sci USA 2011, 108:19867–19872.
150. Mangul S, Caciula A, Glebova O, Mandoiu I, Zelikovsky A. Improved tran-
scriptome quantification and reconstruction from RNA-Seq reads using partial
annotations. In Silico Biol 2011, 11:251–261.
151. Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS, Morin RD, Corbett
R et al.: Alternative expression analysis by RNA sequencing. Nat Methods 2010,
7:843–847.
152. Singh D, Orellana CF, Hu Y, Jones CD, Liu Y, Chiang DY, Liu J, Prins JF. FDM: A
graph-based statistical method to detect differential transcription using RNA-
seq data. Bioinformatics 2011, 27:2633–2640.
153. Drewe P, Stegle O, Hartmann L, Kahles A, Bohnert R, Wachter A, Borgwardt K,
Ratsch G. Accurate detection of differential RNA processing. Nucleic Acids Res
2013, 41:5189–5198.
154. Shi Y, Jiang H. rSeqDiff: Detecting differential isoform expression from RNA-
Seq data using hierarchical likelihood ratio test. PLoS One 2013, 8:e79448.
155. Zheng S, Chen L. A hierarchical Bayesian model for comparing transcriptomes
at the individual transcript isoform level. Nucleic Acids Res 2009, 37:e75.
156. Glaus P, Honkela A, Rattray M. Identifying differentially expressed tran-
scripts from RNA-seq data with biological variation. Bioinformatics 2012, 28:
1721–1728.
References 221
157. Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD,
Gould MN, Stewart RM, Kendziorski C. EBSeq: An empirical Bayes hierarchical
model for inference in RNA-seq experiments. Bioinformatics 2013, 29:1035–1043.
158. Li B, Dewey CN. RSEM: Accurate transcript quantification from RNA-Seq data
with or without a reference genome. BMC Bioinformatics 2011, 12:323.
159. Nicolae M, Mangul S, Mandoiu, II, Zelikovsky A. Estimation of alternative
splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol 2011, 6:9.
160. Martin J, Bruno VM, Fang Z, Meng X, Blow M, Zhang T, Sherlock G, Snyder M,
Wang Z. Rnnotator: An automated de novo transcriptome assembly pipeline
from stranded RNA-Seq reads. BMC Genomics 2010, 11:663.
161. Sacomoto GA, Kielbassa J, Chikhi R, Uricaru R, Antoniou P, Sagot MF,
Peterlongo P, Lacroix V. KISSPLICE: De-novo calling alternative splicing events
from RNA-seq data. BMC Bioinformatics 2012, 13 Suppl 6:S5.
162. Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of
large gene lists using DAVID bioinformatics resources. Nat Protoc 2009, 4:44–57.
163. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA,
Paulovich A et al. Gene set enrichment analysis: A knowledge-based approach
for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005,
102:15545–15550.
164. Salzman J, Gawad C, Wang PL, Lacayo N, Brown PO. Circular RNAs are the
predominant transcript isoform from hundreds of human genes in diverse cell
types. PLoS One 2012, 7:e30733.
165. Davare MA, Tognon CE. Detecting and targeting oncogenic fusion proteins in
the genomic era. Biol Cell 2015, 107(5):111–129.
166. Huang V, Qin Y, Wang J, Wang X, Place RF, Lin G, Lue TF, Li LC. RNAa is con-
served in mammalian cells. PLoS One 2010, 5:e8848.
167. Alon S, Vigneault F, Eminaga S, Christodoulou DC, Seidman JG, Church GM,
Eisenberg E. Barcoding bias in high-throughput multiplex sequencing of
miRNA. Genome Res 2011, 21:1506–1511.
168. Hafner M, Renwick N, Brown M, Mihailovic A, Holoch D, Lin C, Pena JT et al.
RNA-ligase-dependent biases in miRNA representation in deep-sequenced
small RNA cDNA libraries. RNA 2011, 17:1697–1712.
169. Linsen SE, de Wit E, Janssens G, Heater S, Chapman L, Parkin RK, Fritz B et al.
Limitations and possibilities of small RNA digital gene expression profiling.
Nat Methods 2009, 6:474–476.
170. Tian G, Yin X, Luo H, Xu X, Bolund L, Zhang X, Gan SQ, Li N. Sequencing bias:
Comparison of different protocols of microRNA library construction. BMC
Biotechnol 2010, 10:64.
171. Metpally RP, Nasser S, Malenica I, Courtright A, Carlson E, Ghaffari L, Villa S,
Tembe W, Van Keuren-Jensen K. Comparison of analysis tools for miRNA high
throughput sequencing using nerve crush as a model. Front Genet 2013, 4:20.
172. Huang PJ, Liu YC, Lee CC, Lin WC, Gan RR, Lyu PC, Tang P. DSAP: Deep-
sequencing small RNA analysis pipeline. Nucleic Acids Res 2010, 38:W385–391.
173. Hackenberg M, Sturm M, Langenberger D, Falcon-Perez JM, Aransay AM.
miRanalyzer: A microRNA detection and analysis tool for next-generation
sequencing experiments. Nucleic Acids Res 2009, 37:W68–76.
174. Friedlander MR, Mackowiak SD, Li N, Chen W, Rajewsky N. miRDeep2 accu-
rately identifies known and hundreds of novel microRNA genes in seven ani-
mal clades. Nucleic Acids Res 2012, 40:37–52.
222 References
175. Wang WC, Lin FM, Chang WC, Lin KY, Huang HD, Lin NS. miRExpress:
Analyzing high-throughput sequencing data for profiling microRNA expres-
sion. BMC Bioinformatics 2009, 10:328.
176. Ronen R, Gan I, Modai S, Sukacheov A, Dror G, Halperin E, Shomron N.
miRNAkey: A software for microRNA deep sequencing analysis. Bioinformatics
2010, 26:2615–2616.
177. Zhu E, Zhao F, Xu G, Hou H, Zhou L, Li X, Sun Z, Wu J. mirTools: microRNA
profiling and discovery based on high-throughput sequencing. Nucleic Acids
Res 2010, 38:W392–397.
178. Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR,
Gardner PP, Bateman A. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res
2013, 41:D226–232.
179. Axtell MJ. Butter: High-precision genomic alignment of small RNA-seq data.
bioRxiv 2014:007427.
180. Morin RD, O’Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL,
Zhao Y et al. Application of massively parallel sequencing to microRNA profil-
ing and discovery in human embryonic stem cells. Genome Res 2008, 18:610–621.
181. Fernandez-Valverde SL, Taft RJ, Mattick JS. Dynamic isomiR regulation in
Drosophila development. RNA 2010, 16:1881–1888.
182. Li SC, Tsai KW, Pan HW, Jeng YM, Ho MR, Li WH. MicroRNA 3’ end nucleotide
modification patterns and arm selection preference in liver tissues. BMC Syst
Biol 2012, 6 Suppl 2:S14.
183. Pan CT, Tsai KW, Hung TM, Lin WC, Pan CY, Yu HR, Li SC. miRSeq: A user-
friendly standalone toolkit for sequencing quality evaluation and miRNA pro-
filing. Biomed Res Int 2014, 2014:462135.
184. Pantano L, Estivill X, Marti E. SeqBuster, a bioinformatic tool for the processing
and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications
in human embryonic cells. Nucleic Acids Res 2010, 38:e34.
185. Garmire LX, Subramaniam S. Evaluation of normalization methods in mam-
malian microRNA-Seq data. RNA 2012, 18:1279–1288.
186. Dillies MA, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant
N, Keime C et al. A comprehensive evaluation of normalization methods for
Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 2013,
14:671–683.
187. Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. MicroRNA targets in
Drosophila. Genome Biol 2003, 5:R1.
188. Betel D, Koppal A, Agius P, Sander C, Leslie C. Comprehensive modeling of
microRNA targets predicts functional non-conserved and non-canonical sites.
Genome Biol 2010, 11:R90.
189. Krek A, Grun D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P
et al. Combinatorial microRNA target predictions. Nat Genet 2005, 37:495–500.
190. Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E. The role of site accessibility
in microRNA target recognition. Nat Genet 2007, 39:1278–1284.
191. Miranda KC, Huynh T, Tay Y, Ang YS, Tam WL, Thomson AM, Lim B, Rigoutsos
I. A pattern-based method for the identification of MicroRNA binding sites and
their corresponding heteroduplexes. Cell 2006, 126:1203–1217.
192. Rehmsmeier M, Steffen P, Hochsmann M, Giegerich R. Fast and effective pre-
diction of microRNA/target duplexes. RNA 2004, 10:1507–1517.
References 223
193. Lewis BP, Burge CB, Bartel DP. Conserved seed pairing, often flanked by aden-
osines, indicates that thousands of human genes are microRNA targets. Cell
2005, 120:15–20.
194. Reczko M, Maragkakis M, Alexiou P, Grosse I, Hatzigeorgiou AG. Functional
microRNA targets in protein coding sequences. Bioinformatics 2012, 28:771–776.
195. Maragkakis M, Alexiou P, Papadopoulos GL, Reczko M, Dalamagas T,
Giannopoulos G, Goumas G et al. Accurate microRNA target prediction cor-
relates with protein repression levels. BMC Bioinformatics 2009, 10:295.
196. Ritchie W, Flamant S, Rasko JE. Predicting microRNA targets and functions:
Traps for the unwary. Nat Methods 2009, 6:397–398.
197. Vlachos IS, Kostoulas N, Vergoulis T, Georgakilas G, Reczko M, Maragkakis M,
Paraskevopoulou MD, Prionidis K, Dalamagas T, Hatzigeorgiou AG. DIANA
miRPath v.2.0: Investigating the combinatorial effect of microRNAs in path-
ways. Nucleic Acids Res 2012, 40:W498–504.
198. Veltman JA, Brunner HG. De novo mutations in human genetic disease. Nat
Rev Genet 2012, 13(8):565–575.
199. Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes
through second-generation sequencing. Nat Rev Genet 2010, 11(10):685–696.
200. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky
A, Garimella K et al. The Genome Analysis Toolkit: A MapReduce frame-
work for analyzing next-generation DNA sequencing data. Genome Res 2010,
20(9):1297–1303.
201. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis
G, Durbin R, Genome Project Data Processing Subgroup. The Sequence
Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078–2079.
202. Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, Wang J. SNP detection for mas-
sively parallel whole-genome resequencing. Genome Res 2009, 19(6):1124–1132.
203. Challis D, Yu J, Evani US, Jackson AR, Paithankar S, Coarfa C, Milosavljevic
A, Gibbs RA, Yu F. An integrative variant analysis suite for whole exome next-
generation sequencing data. BMC Bioinformatics 2012, 13:8.
204. Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA,
Mardis ER, Ding L, Wilson RK. VarScan 2: Somatic mutation and copy num-
ber alteration discovery in cancer by exome sequencing. Genome Res 2012,
22(3):568–576.
205. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C,
Gabriel S, Meyerson M, Lander ES, Getz G. Sensitive detection of somatic point
mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013,
31(3):213–219.
206. Larson DE, Harris CC, Chen K, Koboldt DC, Abbott TE, Dooling DJ, Ley TJ,
Mardis ER, Wilson RK, Ding L. SomaticSniper: Identification of somatic point
mutations in whole genome sequencing data. Bioinformatics 2012, 28(3):311–317.
207. Saunders CT, Wong WS, Swamy S, Becq J, Murray LJ, Cheetham RK. Strelka:
Accurate somatic small-variant calling from sequenced tumor-normal sample
pairs. Bioinformatics 2012, 28(14):1811–1817.
208. Roth A, Ding J, Morin R, Crisan A, Ha G, Giuliany R, Bashashati A et al.
JointSNVMix: A probabilistic model for accurate detection of somatic muta-
tions in normal/tumour paired next-generation sequencing data. Bioinformatics
2012, 28(7):907–913.
224 References
209. Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R.
Dindel: Accurate indel calls from short-read data. Genome Res 2011, 21(6):961–973.
210. Li S, Li R, Li H, Lu J, Li Y, Bolund L, Schierup MH, Wang J. SOAPindel: Efficient
identification of indels from short paired reads. Genome Res 2013, 23(1):195–200.
211. Tang X, Baheti S, Shameer K, Thompson KJ, Wills Q, Niu N, Holcomb IN et al.
The eSNV-detect: A computational system to identify expressed single nucle-
otide variants from transcriptome sequencing data. Nucleic Acids Res 2014,
42(22):e172.
212. Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from
RNA-seq data. Am J Hum Genet 2013, 93(4):641–651.
213. Goya R, Sun MG, Morin RD, Leung G, Ha G, Wiegand KC, Senz J et al.
SNVMix: Predicting single nucleotide variants from next-generation sequenc-
ing of tumors. Bioinformatics 2010, 26(6):730–736.
214. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA,
Handsaker RE et al. The variant call format and VCFtools. Bioinformatics 2011,
27(15):2156–2158.
215. vcflib, https://github.com/ekg/vcflib.
216. Freudenberg-Hua Y, Freudenberg J, Kluck N, Cichon S, Propping P, Nothen
MM. Single nucleotide variation analysis in 65 candidate genes for CNS disor-
ders in a representative sample of the European population. Genome Res 2003,
13(10):2271–2276.
217. Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath
SD et al. BreakDancer: An algorithm for high-resolution mapping of genomic
structural variation. Nat Methods 2009, 6(9):677–681.
218. Sindi S, Helman E, Bashir A, Raphael BJ. A geometric approach for classifica-
tion and comparison of structural variants. Bioinformatics 2009, 25(12):i222–i230.
219. Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell
JC, Hall IM. Genome-wide mapping and assembly of structural variant break-
points in the mouse genome. Genome Res 2010, 20(5):623–635.
220. Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, Snyder M, Gerstein
MB. PEMer: A computational framework with simulation-based error models
for inferring genomic structural variants from massive paired-end sequencing
data. Genome Biol 2009, 10(2):R23.
221. Zeitouni B, Boeva V, Janoueix-Lerosey I, Loeillet S, Legoix-ne P, Nicolas A,
Delattre O, Barillot E. SVDetect: A tool to identify genomic structural varia-
tions from paired-end and mate-pair sequencing data. Bioinformatics 2010,
26(15):1895–1896.
222. Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, Rusch MC et al.
CREST maps somatic structural variation in cancer genomes with base-pair
resolution. Nat Methods 2011, 8(8):652–654.
223. Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: A pattern growth
approach to detect break points of large deletions and medium sized insertions
from paired-end short reads. Bioinformatics 2009, 25(21):2865–2871.
224. Emde AK, Schulz MH, Weese D, Sun R, Vingron M, Kalscheuer VM, Haas
SA, Reinert K. Detecting genomic indel variants with exact breakpoints in
single- and paired-end sequencing data using SplazerS. Bioinformatics 2012,
28(5):619–627.
References 225
243. Li B, Leal SM. Methods for detecting associations with rare variants for com-
mon diseases: Application to analysis of sequence data. Am J Hum Genet 2008,
83(3):311–321.
244. Schatz MC, Delcher AL, Salzberg SL. Assembly of large genomes using second-
generation sequencing. Genome Res 2010, 20:1165–1173.
245. Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly
using de Bruijn graphs. Genome Res 2008, 18:821–829.
246. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: A parallel
assembler for short read sequence data. Genome Res 2009, 19:1117–1123.
247. van Heesch S, Kloosterman WP, Lansu N, Ruzius FP, Levandowsky E, Lee CC,
Zhou S et al. Improving mammalian genome scaffolding using large insert
mate-pair next-generation sequencing. BMC Genomics 2013, 14:257.
248. Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q et al. The sequence and de
novo assembly of the giant panda genome. Nature 2010, 463:311–317.
249. Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet 2013, 14:157–167.
250. Desai A, Marwah VS, Yadav A, Jha V, Dhaygude K, Bangar U, Kulkarni V, Jere
A. Identification of optimum sequencing depth especially for de novo genome
assembly of small genomes using next generation sequencing data. PLoS One
2013, 8:e60204.
251. Magoc T, Salzberg SL. FLASH: Fast length adjustment of short reads to improve
genome assemblies. Bioinformatics 2011, 27:2957–2963.
252. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD. PANDAseq:
Paired-end assembler for illumina sequences. BMC Bioinformatics 2012, 13:31.
253. Kelley DR, Schatz MC, Salzberg SL. Quake: Quality-aware detection and cor-
rection of sequencing errors. Genome Biol 2010, 11:R116.
254. Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe
T et al. High-quality draft assemblies of mammalian genomes from massively
parallel sequence data. Proc Natl Acad Sci USA 2011, 108:1513–1518.
255. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA frag-
ment assembly. Proc Natl Acad Sci USA 2001, 98:9748–9753.
256. Marcais G, Kingsford C. A fast, lock-free approach for efficient parallel count-
ing of occurrences of k-mers. Bioinformatics 2011, 27:764–770.
257. Chikhi R, Medvedev P. Informed and automated k-mer size selection for
genome assembly. Bioinformatics 2014, 30:31–37.
258. Simpson JT. Exploring genome characteristics and sequence quality without a
reference. Bioinformatics 2014, 30:1228–1235.
259. VelvetOptimiser, https://github.com/Victorian-Bioinformatics-Consortium
/VelvetOptimiser.
260. Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones:
A mathematical analysis. Genomics 1988, 2:231–239.
261. Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA
sequences using SSAKE. Bioinformatics 2007, 23:500–501.
262. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly
accurate short-read assembly algorithm for de novo genomic sequencing.
Genome Res 2007, 17:1697–1706.
263. Jeck WR, Reinhardt JA, Baltrus DA, Hickenbotham MT, Magrini V, Mardis ER,
Dangl JL, Jones CD. Extending assembly of short DNA sequences to handle
error. Bioinformatics 2007, 23:2942–2944.
References 227
264. Miller JR, Delcher AL, Koren S, Venter E, Walenz BP, Brownley A, Johnson
J, Li K, Mobarry C, Sutton G. Aggressive assembly of pyrosequencing reads
with mates. Bioinformatics 2008, 24:2818–2824.
265. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial
genome sequencing: Millions of very short reads assembled on a desktop com-
puter. Genome Res 2008, 18:802–809.
266. Li H. Exploring single-sample SNP and INDEL calling with whole-genome de
novo assembly. Bioinformatics 2012, 28:1838–1844.
267. Diguistini S, Liao NY, Platt D, Robertson G, Seidel M, Chan SK, Docking TR
et al. De novo genome sequence assembly of a filamentous fungus using
Sanger, 454 and Illumina sequence data. Genome Biol 2009, 10:R94.
268. Myers EW. Toward simplifying and accurately formulating fragment assembly.
J Comput Biol 1995, 2:275–290.
269. Gonnella G, Kurtz S. Readjoiner: A fast and memory efficient string graph-
based sequence assembler. BMC Bioinformatics 2012, 13:82.
270. Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES,
Nusbaum C, Jaffe DB. ALLPATHS: De novo assembly of whole-genome shot-
gun microreads. Genome Res 2008, 18:810–820.
271. Maccallum I, Przybylski D, Gnerre S, Burton J, Shlyakhter I, Gnirke A, Malek J
et al. ALLPATHS 2: Small genomes assembled accurately and with high conti-
nuity from short paired reads. Genome Biol 2009, 10:R103.
272. Peng Y, Leung HC, Yiu S-M, Chin FY. IDBA—A practical iterative de Bruijn
graph de novo assembler. In Research in Computational Molecular Biology,
Springer; 2010: 426–440.
273. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G et al. SOAPdenovo2: An empir-
ically improved memory-efficient short-read de novo assembler. Gigascience
2012, 1:18.
274. Ye C, Ma ZS, Cannon CH, Pop M, Yu DW. Exploiting sparseness in de novo
genome assembly. BMC Bioinformatics 2012, 13 Suppl 6:S1.
275. Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA
genome assembler. Bioinformatics 2013, 29:2669–2677.
276. Pop M, Kosack DS, Salzberg SL. Hierarchical scaffolding with Bambus. Genome
Res 2004, 14:149–159.
277. Koren S, Treangen TJ, Pop M. Bambus 2: Scaffolding metagenomes. Bioinformatics
2011, 27:2964–2971.
278. Gao S, Sung WK, Nagarajan N. Opera: Reconstructing optimal genomic
scaffolds with high-throughput paired-end sequences. J Comput Biol 2011,
18:1681–1691.
279. Dayarian A, Michael TP, Sengupta AM. SOPRA: Scaffolding algorithm for
paired reads via statistical optimization. BMC Bioinformatics 2010, 11:345.
280. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-
assembled contigs using SSPACE. Bioinformatics 2011, 27:578–579.
281. Hunt M, Newbold C, Berriman M, Otto TD. A comprehensive evaluation of
assembly scaffolding tools. Genome Biol 2014, 15:R42.
282. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: Quality assessment tool
for genome assemblies. Bioinformatics 2013, 29:1072–1075.
283. Tsai IJ, Otto TD, Berriman M. Improving draft assemblies by iterative mapping
and assembly of short reads to eliminate gaps. Genome Biol 2010, 11:R41.
228 References
284. Boetzer M, Pirovano W. Toward almost closed genomes with GapFiller. Genome
Biol 2012, 13:R56.
285. Bao E, Jiang T, Girke T. AlignGraph: Algorithm for secondary de novo genome
assembly guided by closely related references. Bioinformatics 2014, 30:i319–i328.
286. Kim J, Larkin DM, Cai Q, Asan, Zhang Y, Ge RL, Auvil L et al. Reference-
assisted chromosome assembly. Proc Natl Acad Sci USA 2013, 110:1785–1790.
287. Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y et al.
Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods 2012,
9:609–614.
288. Daley T, Smith AD. Predicting the molecular complexity of sequencing librar-
ies. Nat Methods 2013, 10:325–327.
289. Diaz A, Nellore A, Song JS. CHANCE: Comprehensive software for quality
control and validation of ChIP-seq data. Genome Biol 2012, 13:R98.
290. Zhao X, Sandelin A. GMD: Measuring the distance between histograms
with applications on high-throughput sequencing reads. Bioinformatics 2012,
28:1164–1165.
291. Diaz A, Park K, Lim DA, Song JS. Normalization, bias correction, and peak call-
ing for ChIP-seq. Stat Appl Genet Mol Biol 2012, 11:Article 9.
292. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R,
Carriero N, Snyder M, Gerstein MB. PeakSeq enables systematic scoring of
ChIP-seq experiments relative to controls. Nat Biotechnol 2009, 27:66–75.
293. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C
et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008, 9:R137.
294. Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH. An integrated soft-
ware system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008,
26:1293–1300.
295. Shen L, Shao NY, Liu X, Maze I, Feng J, Nestler EJ. diffReps: Detecting differen-
tial chromatin modification sites from ChIP-seq data with biological replicates.
PLoS One 2013, 8:e65598.
296. Manser P, Reimers M. A simple scaling normalization for comparing ChIP-Seq
samples. PeerJ PrePrints 2014, 1.
297. Shao Z, Zhang Y, Yuan GC, Orkin SH, Waxman DJ. MAnorm: A robust model
for quantitative comparison of ChIP-Seq data sets. Genome Biol 2012, 13:R16.
298. Nair NU, Sahu AD, Bucher P, Moret BM. ChIPnorm: A statistical method for
normalizing and identifying differential regions in histone modification ChIP-
seq libraries. PLoS One 2012, 7:e39573.
299. Taslim C, Wu J, Yan P, Singer G, Parvin J, Huang T, Lin S, Huang K. Comparative
study on ChIP-seq data: Normalization and binding pattern characterization.
Bioinformatics 2009, 25:2334–2340.
300. Kharchenko PV, Tolstorukov MY, Park PJ. Design and analysis of ChIP-seq
experiments for DNA-binding proteins. Nat Biotechnol 2008, 26:1351–1359.
301. Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S,
Bernstein BE et al. ChIP-seq guidelines and practices of the ENCODE and
modENCODE consortia. Genome Res 2012, 22:1813–1831.
302. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C,
Singh H, Glass CK. Simple combinations of lineage-determining transcription
factors prime cis-regulatory elements required for macrophage and B cell iden-
tities. Mol Cell 2010, 38:576–589.
References 229
303. Xu H, Handoko L, Wei X, Ye C, Sheng J, Wei CL, Lin F, Sung WK. A signal-noise
model for significance analysis of ChIP-seq with negative control. Bioinformatics
2010, 26:1199–1204.
304. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quan-
tifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5:621–628.
305. Boyle AP, Guinney J, Crawford GE, Furey TS. F-Seq: A feature density estima-
tor for high-throughput sequence tags. Bioinformatics 2008, 24:2537–2538.
306. Tuteja G, White P, Schug J, Kaestner KH. Extracting transcription factor targets
from ChIP-Seq data. Nucleic Acids Res 2009, 37:e113.
307. Feng X, Grossman R, Stein L. PeakRanger: A cloud-enabled peak caller for
ChIP-seq data. BMC Bioinformatics 2011, 12:139.
308. Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers
RM, Sidow A. Genome-wide analysis of transcription factor binding sites based
on ChIP-Seq data. Nat Methods 2008, 5:829–834.
309. Song Q, Smith AD. Identifying dispersed epigenomic domains from ChIP-Seq
data. Bioinformatics 2011, 27:870–871.
310. Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W. A clustering approach for
identification of enriched domains from histone modification ChIP-Seq data.
Bioinformatics 2009, 25:1952–1958.
311. Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. Genome-wide identification of
in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res 2008,
36:5221–5231.
312. Nix DA, Courdy SJ, Boucher KM. Empirical methods for controlling false posi-
tives and estimating confidence in ChIP-Seq peaks. BMC Bioinformatics 2008,
9:523.
313. Rashid NU, Giresi PG, Ibrahim JG, Sun W, Lieb JD. ZINBA integrates local
covariates with DNA-seq data to identify broad and narrow regions of enrich-
ment, even within amplified genomic regions. Genome Biol 2011, 12:R67.
314. Shin H, Liu T, Manrai AK, Liu XS. CEAS: cis-Regulatory element annotation
system. Bioinformatics 2009, 25:2605–2606.
315. Zhu LJ, Gazin C, Lawson ND, Pages H, Lin SM, Lapointe DS, Green MR.
ChIPpeakAnno: A Bioconductor package to annotate ChIP-seq and ChIP-chip
data. BMC Bioinformatics 2010, 11:237.
316. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E et al. Integration of
external signaling pathways with the core transcriptional network in embry-
onic stem cells. Cell 2008, 133:1106–1117.
317. Chen L, Wang C, Qin ZS, Wu H. A novel statistical method for quantitative
comparison of multiple ChIP-seq datasets. Bioinformatics 2015, 31:1889–1896.
318. Xu H, Wei CL, Lin F, Sung WK. An HMM approach to genome-wide identifica-
tion of differential histone modification sites from ChIP-seq data. Bioinformatics
2008, 24:2344–2349.
319. Hon G, Ren B, Wang W. ChromaSig: A probabilistic approach to finding com-
mon chromatin signatures in the human genome. PLoS Comput Biol 2008,
4:e1000201.
320. Liang K, Keles S. Detecting differential binding of transcription factors with
ChIP-seq. Bioinformatics 2012, 28:121–122.
321. Stark R, Brown G. DiffBind: Differential binding analysis of ChIP-Seq peak
data. In R package version 2011, 100.
230 References
322. Taslim C, Huang T, Lin S. DIME: R-package for identifying differential ChIP-
seq based on an ensemble of mixture models. Bioinformatics 2011, 27:1569–1570.
323. Schweikert G, Cseke B, Clouaire T, Bird A, Sanguinetti G. MMDiff: Quantitative
testing for shape changes in ChIP-Seq data sets. BMC Genomics 2013, 14:826.
324. Welch RP, Lee C, Imbriano PM, Patil S, Weymouth TE, Smith RA, Scott LJ, Sartor
MA. ChIP-Enrich: Gene set enrichment testing for ChIP-seq data. Nucleic Acids
Res 2014. doi: 10.1093/nar/gku463.
325. Liu T, Ortiz JA, Taing L, Meyer CA, Lee B, Zhang Y, Shin H et al. Cistrome: An
integrative platform for transcriptional regulation studies. Genome Biol 2011, 12:R83.
326. Machanick P, Bailey TL. MEME-ChIP: Motif analysis of large DNA datasets.
Bioinformatics 2011, 27:1696–1697.
327. Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden
J. RSAT peak-motifs: Motif analysis in full-size ChIP-seq datasets. Nucleic Acids
Res 2012, 40:e31.
328. Kulakovskiy IV, Boeva VA, Favorov AV, Makeev VJ. Deep and wide digging for
binding motifs in ChIP-Seq data. Bioinformatics 2010, 26:2622–2623.
329. Mahony S, Benos PV. STAMP: A web tool for exploring DNA-binding motif
similarities. Nucleic Acids Res 2007, 35:W253–258.
330. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity
between motifs. Genome Biol 2007, 8:R24.
331. Bailey TL, Machanick P. Inferring direct DNA binding from ChIP-seq. Nucleic
Acids Res 2012, 40:e128.
332. Grant CE, Bailey TL, Noble WS. FIMO: Scanning for occurrences of a given
motif. Bioinformatics 2011, 27:1017–1018.
333. Ernst J, Kellis M. Discovery and characterization of chromatin states for sys-
tematic annotation of the human genome. Nat Biotechnol 2010, 28:817–825.
334. Klein HU, Schafer M, Porse BT, Hasemann MS, Ickstadt K, Dugas M. Integrative
analysis of histone ChIP-seq and transcription data using Bayesian mixture
models. Bioinformatics 2014 (Epub ahead of print).
335. Standards and Guidelines for Whole Genome Shotgun Bisulfite Sequencing,
http://www.roadmapepigenomics.org/protocols.
336. Ziller MJ, Hansen KD, Meissner A, Aryee MJ. Coverage recommendations
for methylation analysis by whole-genome bisulfite sequencing. Nat Methods
2015, 12(3):230–232.
337. Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R. Reduced
representation bisulfite sequencing for comparative high-resolution DNA
methylation analysis. Nucleic Acids Res 2005, 33:5868–5877.
338. Nautiyal S, Carlton VE, Lu Y, Ireland JS, Flaucher D, Moorhead M, Gray JW
et al. High-throughput method for analyzing methylation of CpGs in targeted
genomic regions. Proc Natl Acad Sci USA 2010, 107:12587–12592.
339. Varley KE, Mitra RD. Bisulfite Patch PCR enables multiplexed sequencing of
promoter methylation across cancer samples. Genome Res 2010, 20:1279–1287.
340. Deng J, Shoemaker R, Xie B, Gore A, LeProust EM, Antosiewicz-Bourget J, Egli
D et al. Targeted bisulfite sequencing reveals changes in DNA methylation
associated with nuclear reprogramming. Nat Biotechnol 2009, 27:353–360.
341. Ivanov M, Kals M, Kacevska M, Metspalu A, Ingelman-Sundberg M, Milani
L. In-solution hybrid capture of bisulfite-converted DNA for targeted bisulfite
sequencing of 174 ADME genes. Nucleic Acids Res 2013, 41:e72.
References 231
342. Harris RA, Wang T, Coarfa C, Nagarajan RP, Hong C, Downey SL, Johnson BE
et al. Comparison of sequencing-based methods to profile DNA methylation
and identification of monoallelic epigenetic modifications. Nat Biotechnol 2010,
28:1097–1105.
343. Nair SS, Coolen MW, Stirzaker C, Song JZ, Statham AL, Strbenac D, Robinson
MD, Clark SJ. Comparison of methyl-DNA immunoprecipitation (MeDIP) and
methyl-CpG binding domain (MBD) protein capture for genome-wide DNA
methylation analysis reveal CpG sequence coverage bias. Epigenetics 2011,
6:34–44.
344. Booth MJ, Branco MR, Ficz G, Oxley D, Krueger F, Reik W, Balasubramanian S.
Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at
single-base resolution. Science 2012, 336:934–937.
345. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J,
Turner SW. Direct detection of DNA methylation during single-molecule, real-
time sequencing. Nat Methods 2010, 7:461–465.
346. Laszlo AH, Derrington IM, Brinkerhoff H, Langford KW, Nova IC, Samson
JM, Bartlett JJ, Pavlenok M, Gundlach JH. Detection and mapping of
5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc
Natl Acad Sci USA 2013, 110:18904–18909.
347. Schreiber J, Wescoe ZL, Abu-Shumays R, Vivian JT, Baatar B, Karplus K, Akeson
M. Error rates for nanopore discrimination among cytosine, methylcytosine,
and hydroxymethylcytosine along individual DNA strands. Proc Natl Acad Sci
USA 2013, 110:18910–18915.
348. Trim Galore!, http://www.bioinformatics.babraham.ac.uk/projects/trim
_galore/.
349. Hansen KD, Langmead B, Irizarry RA. BSmooth: From whole genome bisulfite
sequencing reads to differentially methylated regions. Genome Biol 2012, 13:R83.
350. Liang F, Tang B, Wang Y, Wang J, Yu C, Chen X, Zhu J, Yan J, Zhao W, Li R.
WBSA: Web service for bisulfite sequencing data analysis. PLoS One 2014,
9:e86707.
351. Xi Y, Li W. BSMAP: Whole genome bisulfite sequence MAPping program. BMC
Bioinformatics 2009, 10:232.
352. Coarfa C, Yu F, Miller CA, Chen Z, Harris RA, Milosavljevic A. Pash 3.0: A ver-
satile software package for read mapping and integrative analysis of genomic
and epigenomic variation using massively parallel DNA sequencing. BMC
Bioinformatics 2010, 11:572.
353. Xi Y, Bock C, Muller F, Sun D, Meissner A, Li W. RRBSMAP: A fast, accurate and
user-friendly alignment tool for reduced representation bisulfite sequencing.
Bioinformatics 2012, 28:430–432.
354. Lim JQ, Tennakoon C, Li G, Wong E, Ruan Y, Wei CL, Sung WK. BatMeth:
Improved mapper for bisulfite sequencing reads on DNA methylation. Genome
Biol 2012, 13:R82.
355. Krueger F, Andrews SR. Bismark: A flexible aligner and methylation caller for
Bisulfite-Seq applications. Bioinformatics 2011, 27:1571–1572.
356. Harris EY, Ponts N, Le Roch KG, Lonardi S. BRAT-BW: Efficient and accurate
mapping of bisulfite-treated reads. Bioinformatics 2012, 28:1795–1796.
357. Chen PY, Cokus SJ, Pellegrini M. BS Seeker: Precise mapping for bisulfite
sequencing. BMC Bioinformatics 2010, 11:203.
232 References
358. Guo W, Fiziev P, Yan W, Cokus S, Sun X, Zhang MQ, Chen PY, Pellegrini M.
BS-Seeker2: A versatile aligning pipeline for bisulfite sequencing data. BMC
Genomics 2013, 14:774.
359. Pedersen B, Hsieh TF, Ibarra C, Fischer RL. MethylCoder: Software pipeline for
bisulfite-treated sequences. Bioinformatics 2011, 27:2435–2436.
360. Kunde-Ramamoorthy G, Coarfa C, Laritsky E, Kessler NJ, Harris RA, Xu M,
Chen R, Shen L, Milosavljevic A, Waterland RA. Comparison and quantitative
verification of mapping algorithms for whole-genome bisulfite sequencing.
Nucleic Acids Res 2014, 42:e43.
361. Lin X, Sun D, Rodriguez B, Zhao Q, Sun H, Zhang Y, Li W. BSeQC: Quality
control of bisulfite sequencing experiments. Bioinformatics 2013, 29:3227–3229.
362. Benoukraf T, Wongphayak S, Hadi LH, Wu M, Soong R. GBSA: A comprehen-
sive software for analysing whole genome bisulfite sequencing data. Nucleic
Acids Res 2013, 41:e55.
363. Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A,
Mason CE. methylKit: A comprehensive R package for the analysis of genome-
wide DNA methylation profiles. Genome Biol 2012, 13:R87.
364. Washington University EpiGenome Browser, http://epigenomegateway.wustl
.edu/browser/.
365. Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and
BigBed: Enabling browsing of large distributed datasets. Bioinformatics 2010,
26:2204–2207.
366. Dorff KC, Chambwe N, Zeno Z, Simi M, Shaknovich R, Campagne F. GobyWeb:
Simplified management and analysis of gene expression and DNA methylation
sequencing data. PLoS One 2013, 8:e69666.
367. Jiang P, Sun K, Lun FM, Guo AM, Wang H, Chan KC, Chiu RW, Lo YM, Sun H.
Methy-Pipe: An integrated bioinformatics pipeline for whole genome bisulfite
sequencing data analysis. PLoS One 2014, 9:e100360.
368. Sun D, Xi Y, Rodriguez B, Park HJ, Tong P, Meong M, Goodell MA, Li W. MOABS:
Model based analysis of bisulfite sequencing data. Genome Biol 2014, 15:R38.
369. Zhang Y, Liu H, Lv J, Xiao X, Zhu J, Liu X, Su J et al. QDMR: A quantitative
method for identification of differentially methylated regions by entropy.
Nucleic Acids Res 2011, 39:e58.
370. Statham AL, Strbenac D, Coolen MW, Stirzaker C, Clark SJ, Robinson MD.
Repitools: An R package for the analysis of enrichment-based epigenomic data.
Bioinformatics 2010, 26:1662–1663.
371. Halachev K, Bast H, Albrecht F, Lengauer T, Bock C. EpiExplorer: Live explora-
tion and global analysis of large epigenomic datasets. Genome Biol 2012, 13:R96.
372. McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM,
Bejerano G. GREAT improves functional interpretation of cis-regulatory
regions. Nat Biotechnol 2010, 28:495–501.
373. Schloss PD, Handelsman J. Metagenomics for studying unculturable micro
organisms: Cutting the Gordian knot. Genome Biol 2005, 6:229.
374. Yarza P, Ludwig W, Euzeby J, Amann R, Schleifer KH, Glockner FO, Rossello-
Mora R. Update of the All-Species Living Tree Project based on 16S and 23S
rRNA sequence analyses. Syst Appl Microbiol 2010, 33:291–299.
375. Delmont TO, Robe P, Clark I, Simonet P, Vogel TM. Metagenomic comparison
of direct and indirect soil DNA extraction approaches. J Microbiol Methods 2011,
86:397–400.
References 233
396. Chan CK, Hsu AL, Halgamuge SK, Tang SL. Binning sequences using very
sparse labels within a metagenome. BMC Bioinformatics 2008, 9:215.
397. Zheng H, Wu H. Short prokaryotic DNA fragment binning using a hierarchi-
cal classifier based on linear discriminant analysis and principal component
analysis. J Bioinform Comput Biol 2010, 8:995–1011.
398. Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC. Integrative analy-
sis of environmental sequences using MEGAN4. Genome Res 2011, 21:1552–1560.
399. Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data.
Genome Res 2007, 17:377–386.
400. Gerlach W, Junemann S, Tille F, Goesmann A, Stoye J. WebCARMA: A web
application for the functional and taxonomic classification of unassembled
metagenomic reads. BMC Bioinformatics 2009, 10:430.
401. Gerlach W, Stoye J. Taxonomic classification of metagenomic shotgun sequences
with CARMA3. Nucleic Acids Res 2011, 39:e91.
402. Monzoorul Haque M, Ghosh TS, Komanduri D, Mande SS. SOrt-ITEMS:
Sequence orthology based approach for improved taxonomic estimation of
metagenomic sequences. Bioinformatics 2009, 25:1722–1730.
403. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast estimation of
taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 2011,
12 Suppl 2:S4.
404. Leung HC, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, Chen J, Qin J, Li R, Chin
FY. A robust and accurate binning algorithm for metagenomic sequences with
arbitrary species abundance ratio. Bioinformatics 2011, 27:1489–1495.
405. Mohammed MH, Ghosh TS, Singh NK, Mande SS. SPHINX—An algorithm for
taxonomic binning of metagenomic sequences. Bioinformatics 2011, 27:22–30.
406. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu
D et al. The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic
through eastern tropical Pacific. PLoS Biol 2007, 5:e77.
407. Davenport CF, Neugebauer J, Beckmann N, Friedrich B, Kameri B, Kokott S,
Paetow M et al. Genometa—A fast and accurate classifier for short metage-
nomic shotgun reads. PLoS One 2012, 7:e41224.
408. Niu B, Zhu Z, Fu L, Wu S, Li W. FR-HIT, a very fast program to recruit metage-
nomic reads to homologous reference genomes. Bioinformatics 2011, 27:1704–1705.
409. Rho M, Tang H, Ye Y. FragGeneScan: Predicting genes in short and error-prone
reads. Nucleic Acids Res 2010, 38:e191.
410. Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metage-
nomic sequences. Nucleic Acids Res 2010, 38:e132.
411. Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: Detecting species-
specific patterns of ribosomal binding site for precise gene prediction in anony-
mous prokaryotic and phage genomes. DNA Res 2008, 15:387–396.
412. Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia: Predicting genes in metage-
nomic sequencing reads. Nucleic Acids Res 2009, 37:W101–105.
413. Lowe TM, Eddy SR. tRNAscan-SE: A program for improved detection of trans-
fer RNA genes in genomic sequence. Nucleic Acids Res 1997, 25:955–964.
414. Grissa I, Vergnaud G, Pourcel C. CRISPRFinder: A web tool to identify clustered
regularly interspaced short palindromic repeats. Nucleic Acids Res 2007, 35:W52–57.
415. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower
C. Metagenomic microbial community profiling using unique clade-specific
marker genes. Nat Methods 2012, 9:811–814.
References 235
416. Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic infer-
ence. Genome Biol 2008, 9:R151.
417. Kerepesi C, Banky D, Grolmusz V. AmphoraNet: The webserver implementa-
tion of the AMPHORA2 metagenomic workflow suite. Gene 2014, 533:538–540.
418. Wu M, Scott AJ. Phylogenomic analysis of bacterial and archaeal sequences
with AMPHORA2. Bioinformatics 2012, 28:1033–1034.
419. Sharpton TJ, Riesenfeld SJ, Kembel SW, Ladau J, O’Dwyer JP, Green JL, Eisen
JA, Pollard KS. PhylOTU: A high-throughput procedure quantifies microbial
community diversity and resolves novel taxa from metagenomic data. PLoS
Comput Biol 2011, 7:e1001061.
420. Darling AE, Jospin G, Lowe E, Matsen FA, Bik HM, Eisen JA. PhyloSift:
Phylogenetic analysis of genomes and metagenomes. PeerJ 2014, 2:e243.
421. Li W. Analysis and comparison of very large metagenomes with fast clustering
and functional annotation. BMC Bioinformatics 2009, 10:359.
422. Arumugam M, Harrington ED, Foerstner KU, Raes J, Bork P. SmashCommunity:
A metagenomic annotation and analysis tool. Bioinformatics 2010, 26:2977–2978.
423. Treangen TJ, Koren S, Sommer DD, Liu B, Astrovskaya I, Ondov B, Darling AE,
Phillippy AM, Pop M. MetAMOS: A modular and open source metagenomic
assembly and analysis pipeline. Genome Biol 2013, 14:R2.
424. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de
Crecy-Lagard V et al. The subsystems approach to genome annotation and its
use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33:5691–5702.
425. Franzosa EA, Morgan XC, Segata N, Waldron L, Reyes J, Earl AM, Giannoukos
G et al. Relating the metatranscriptome and metagenome of the human gut.
Proc Natl Acad Sci USA 2014, 111:E2329–2338.
426. Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, Rodriguez-
Mueller B et al. Metabolic reconstruction for metagenomic data and its applica-
tion to the human microbiome. PLoS Comput Biol 2012, 8:e1002358.
427. Ye Y, Doak TG. A parsimony approach to biological pathway reconstruction/
inference for genomes and metagenomes. PLoS Comput Biol 2009, 5:e1000465.
428. Liu B, Pop M. MetaPath: Identifying differentially abundant metabolic path-
ways in metagenomic datasets. BMC Proc 2011, 5 Suppl 2:S9.
429. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for
microbial marker-gene surveys. Nat Methods 2013, 10:1200–1202.
430. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, Huttenhower
C. Metagenomic biomarker discovery and explanation. Genome Biol 2011,
12:R60.
431. White JR, Nagarajan N, Pop M. Statistical methods for detecting differentially
abundant features in clinical metagenomic samples. PLoS Comput Biol 2009,
5:e1000352.
432. Parks DH, Tyson GW, Hugenholtz P, Beiko RG. STAMP: Statistical analysis of
taxonomic and functional profiles. Bioinformatics 2014, 30:3123–3124.
433. Rodriguez-Brito B, Rohwer F, Edwards RA. An application of statistics to com-
parative metagenomics. BMC Bioinformatics 2006, 7:162.
434. Jain M, Fiddes IT, Miga KH, Olsen HE, Paten B, Akeson M. Improved data
analysis for the MinION nanopore sequencer. Nat Methods 2015, 12:351–356.
435. Ohshiro T, Matsubara K, Tsutsui M, Furuhashi M, Taniguchi M, Kawai T.
Single-molecule electrical random resequencing of DNA and RNA. Sci Rep
2012, 2:501.
236 References
436. Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, Clum A
et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT
sequencing data. Nat Methods 2013, 10:563–569.
437. FALCON, https://github.com/PacificBiosciences/falcon.
438. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis
AA et al. A framework for variation discovery and genotyping using next-
generation DNA sequencing data. Nat Genet 2011, 43:491–498.
439. Reid JG, Carroll A, Veeraraghavan N, Dahdouli M, Sundquist A, English
A, Bainbridge M et al. Launching genomics into the cloud: Deployment of
Mercury, a next generation sequence analysis pipeline. BMC Bioinformatics
2014, 15:30.
440. Zhao S, Prenger K, Smith L, Messina T, Fan H, Jaeger E, Stephens S. Rainbow: A
tool for large-scale whole-genome sequencing data analysis using cloud com-
puting. BMC Genomics 2013, 14:425.
441. Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP. Computational
solutions to large-scale data management and analysis. Nat Rev Genet 2010,
11:647–657.
Statistics
Before detailing the analytic steps for each of these applications, the book
presents the ins and outs of the most widely used NGS platforms, with
side-by-side comparisons of key technical aspects. This helps practitioners
decide which platform to use for a particular project. The book also offers
a perspective on the development of DNA sequencing technologies, from
Sanger to future-generation sequencing technologies.
The book discusses concepts and principles that underlie each analytic
step, along with software tools for implementation. It highlights key features
of the tools while omitting tedious details to provide an easy-to-follow
guide for practitioners in life sciences, bioinformatics, and biostatistics. In
addition, references to detailed descriptions of the tools are given for further
reading if needed. The accompanying website for the book provides step-
by-step, real-world examples of how to apply the tools covered in the text to
research projects. All the tools are freely available to academic users.
K22126
w w w. c rc p r e s s . c o m