ARTICLE
Received 23 Nov 2015 | Accepted 26 May 2016 | Published 30 Jun 2016
DOI: 10.1038/ncomms12065
OPEN
Long-read sequencing and de novo assembly
of a Chinese genome
Lingling Shi1,2,3,*, Yunfei Guo4,*, Chengliang Dong4, John Huddleston5, Hui Yang4, Xiaolu Han6, Aisi Fu7,
Quan Li4, Na Li1, Siyi Gong1, Katherine E. Lintner8, Qiong Ding7, Zou Wang7, Jiang Hu9, Depeng Wang9,
Feng Wang10, Lin Wang11, Gholson J. Lyon12, Yongtao Guan13, Yufeng Shen14, Oleg V. Evgrafov4,15,
James A. Knowles4,15, Francoise Thibaud-Nissen16, Valerie Schneider16, Chack-Yung Yu8, Libing Zhou1,2,3,
Evan E. Eichler5, Kwok-Fai So1,2,3,17,18 & Kai Wang4,15
Short-read sequencing has enabled the de novo assembly of several individual human
genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a
Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct
a physical map by NanoChannel arrays and generate a de novo assembly of 2.93 Gb (contig
N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of
alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the
reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific
sequences, including 4.1 Mb that are not present in previously reported Asian genomes.
Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are
not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that
improved characterization of genome functional variation may require the use of a range of
genomic technologies on diverse human populations.
1 Guangdong-Hongkong-Macau Institute of CNS Regeneration, Jinan University, Guangzhou 510632, China. 2 Ministry of Education Joint International
Research Laboratory of CNS Regeneration, Jinan University, Guangzhou 510632, China. 3 Co-innovation Center of Neuroregeneration, Nantong University,
Nantong 226001, China. 4 Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California 90089, USA. 5 Department of Genome
Sciences, Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA. 6 Genetic, Molecular, and Cellular Biology Program,
Keck School of Medicine, University of Southern California, Los Angeles, California 90089, USA. 7 Wuhan Institute of Biotechnology, Wuhan 430000, China.
8 Department of Pediatrics, The Ohio State University, and The Research Institute at Nationwide Children’s Hospital, Columbus, Ohio 43205, USA.
9 Nextomics Biosciences, Wuhan 430000, China. 10 School of Chemical Engineering and Pharmacy, Wuhan Institute of Technology, Wuhan 430000, China.
11 Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Huazhong University of Science and Technology, Wuhan 430022, China.
12 Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, New York, New York 11797, USA. 13 USDA/ARS Children’s Nutrition Research
Center, Department of Pediatrics, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA. 14 Departments of
Systems Biology and Biomedical Informatics, Columbia University, New York, New York 10032, USA. 15 Department of Psychiatry & Behavioral Sciences, Keck
School of Medicine, University of Southern California, Los Angeles, California 90033, USA. 16 National Center for Biotechnology Information, U.S. National
Library of Medicine, Bethesda, Maryland 20894, USA. 17 Department of Ophthalmology, The University of Hong Kong, Hong Kong, China. 18 State Key
Laboratory of Brain and Cognitive Sciences, The University of Hong Kong, Hong Kong, China. * These authors contributed equally to this work.
Correspondence and requests for materials should be addressed to L.Z. (email:
[email protected]) or to K.-F.S. (email:
[email protected]) or to
K.W. (email:
[email protected]).
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
1
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
T
he advent of next-generation short-read sequencing paved
the way to characterize the genomes of thousands of
species, and had enabled de novo assembly of a few
individual human genomes1,2. However, these assemblies may
have inherent technical limitations in characterizing repeat
elements that span longer than the read length3,4, yet repeats
and segmental duplications are known to cover approximately
half of the human genome. For example, a formal analysis of the
de novo sequence assembly generated from the genome of a Han
Chinese individual and a Yoruban individual showed that
420.2 Mb of common repeats and 99.1% of validated duplicated
sequences were not present, which resulted in missing thousands
of coding exons in the genome assembly3. Therefore, the use of
additional genomic techniques, such as fosmid pooling5 and longread sequencing6, may be necessary to better characterize
complicated genomic regions in human genomes.
Previous studies reported that pervasive genetic differences
exist across different ethnicity groups, especially on structural
variants7–9. For example, through reconstruction of the ancestral
human genome, it was reported that megabases of DNA were lost
in different human lineages and that large duplications were
introgressed from one lineage to another8. In addition, genomic
elements that are absent from reference genomes may be present
in personal genomes10,11. For example, a study estimated that a
complete human pan-genome would contain B19–40 Mb of
novel sequence not present in the extant reference genome12.
These novel sequences that are not present in the reference
genome may harbour functional genomic elements that are
ethnicity-specific, and may affect gene regulations or
transcriptional diversity.
To address the limitations on previously published de novo
genome assembly, and to improve our understanding of
transcriptome variations, here we sequence both DNA and
RNA of a Chinese individual HX1 by single-molecule real-time
(SMRT) long-read sequencing13. We also generate a physical map
of the HX1 genome using IrysChip14, which is a nanopore array
that detects a characteristic seven-nucleotide sequence along very
long genomic segments, typically hundreds of kilobases. We
perform de novo genome assembly to build a Chinese reference
genome, using a hybrid approach that combines long-read
sequencing data and IrysChip data6. We demonstrate a few
unique applications of the HX1 assembly, including the ability to
fill gaps in the human reference genome assembly GRCh38, as
well as the ability to identify fine-scale structural variants. In
parallel, leveraging long-read RNA sequencing, we also identify
novel transcriptional elements, especially those with multiple
spliced isoforms. Through the combined use of a few genomic
techniques, we perform detailed characterization of the HX1
genome and demonstrate that long-read sequencing can detect
functional elements in human genomes that are missed by shortread sequencing.
Results
De novo assembly of a Chinese genome. We leveraged SMRT
DNA sequencing technology13 and sequenced genomic DNA
from an anonymous Chinese individual (HX1) with normal
karyotype (Supplementary Fig. 1) at 103 genome-wide
coverage (Supplementary Table 1). In total, we obtained 44.2
million ‘subreads’ (a portion of the sequencing read that is
informative for downstream analysis) after removing adapters
and performing quality control measures (Supplementary
Methods). These subreads have a mean length of 7.0 kb and a
N50 length of 12.1 kb (Supplementary Table 2 and
Supplementary Fig. 2), where N50 refers to the length for
which the collection of all sequences of that length or longer
contains at least half of the sum of the lengths of all sequences.
We modified and improved the FALCON software15 and
performed de novo genome assembly on the long reads
(Supplementary Methods), resulting in 5,843 contigs
(N50 ¼ 8.3 Mb) and a total size of 2.9 Gb. In addition, 206 Mb
of ‘associated contigs’, that is, alternative haplotypes, were
constructed along with the primary contigs. Finally, we
also performed short-read sequencing on the Illumina HiSeq
X platform, with 143 coverage of the genome (Supplementary
Table 1). Short reads were used to further polish HX1 contigs and
correct indel and single nucleotide variant (SNV) errors. The
continuity of the contigs is substantially higher than assemblies
generated from competing technologies in previous studies
(Table 1), demonstrating the clear advantage of long-read
sequencing in genome assembly. We note that a recently
published genome using the same SMRT technology reported a
contig N50 of 906 kb (ref. 6), and we believe that the almost 10fold improvement in our study can be attributed to the improved
chemistry, longer read length, enhanced assembly algorithm, as
well as the much deeper sequencing depth (Supplementary
Table 3).
To evaluate the completeness and accuracy of the draft HX1
assembly, we performed several analyses. First, we generated a
physical map of the same DNA sample by NanoChannel-based
fluidic IrysChip14. From the IrysChip run with 101 wholegenome coverage (based on all sequence reads 4150 kb;
Supplementary Table 4 and Supplementary Fig. 3), we
calculated the mapping rate of these fragments on different
genome assemblies. We noted that this analysis was biased
against more fragmented assemblies such as HX1, since some
Table 1 | Comparison with previously published de novo assembly on personal genomes.
Sample
Assembly
Accession
Venter
NA12878
HuRef
ALLPATHS
GCA_000002125.2
GCA_000185165.1
BGI-YH
YH_2.0
GCA_000004845.2
BGI-YH
YHref
NA
CHM1
CHM1_1.1
GCF_000306695.2
NA12878
ASM101398v1
GCA_001013985.1
HX1
HX1
NA
Sequencing
platform
Sanger
Illumina GAII
& HiSeq
Illumina
HiSeq
Illumina
HiSeq
Illumina
HiSeq
PacBio SMRT
Scaffolding
platform
BAC
Fosmid
108,431
23,924
Scaffold
N50
17,664,250
12,084,118
2,844,000,504
2,786,258,565
Fosmid
20,516
20,520,932
Fosmid
484,222
BAC
143,936
906 kb/
1.4 Mb (V2)
8,325,004
BioNano
IrysChip
PacBio SMRT BioNano
IrysChip
Contig N50
Size
Size (alternative
haplotype)
NA
NA
No. of
gaps
66,906
220,318
Gap length
34,429,377
171,353,644
2,911,235,363
NA
235,514
105,204,230
23,192,260
2,883,329,361
NA*
NA
NA
50,362,920
3,037,866,619
NA
40,915
210,229,880
26,834,081
3,176,574,379
NA
2,332
146,352,286
21,979,250
2,934,084,193
206,388,248
10,901
39,341,483
NA, not applicable; SMRT, single molecule real time.
*YHref’s alternative haplotype is contained in the haploid-resolved diploid genome (HDG) sequence, which is not readily comparable to our results.
2
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
IrysChip reads may span two different contigs. With a highly
stringent threshold suitable for human genome analysis
(Supplementary Methods), the IrysChip reads yielded
comparable mapping rates to GRCh38 and HX1 at 80.2% and
78.9%, respectively. Second, we aligned HX1 to GRCh38
(Supplementary Methods), and found that 97.11% of the non-N
regions in GRCh38 are covered by HX1 (Supplementary Fig. 4).
This ratio was at a similar level as other personal genome
assemblies including YH2.0 (ref. 10) (94.99%), NA12878 (ref. 6)
(97.50%) and HuRef11 (97.04%). Third, we evaluated consensus
quality of the assemblies, by comparing them with the
chromosomes in the reference genome GRCh38 using
MUMmer16. We found that the consensus accuracy for HX1
was 99.73%, similar to YH2.0 (99.81%), NA12878 (99.73%) and
HuRef (99.84%). Fourth, we also aligned the RefSeq transcripts to
several genome assemblies using the NCBI Assembly Evaluation
pipeline, and found that 391 out of 50,909 transcripts could not
be placed on HX1 versus 455 for NA12878, 306 for YH2.0 and 22
for GRCh38 (primary assembly). The percentage of mapped
coding transcripts with CDS (coding sequence) coverage Z95% is
99.96%, 90.32%, 95.30% and 97.94% for GRCh38, YH2.0,
NA12878 and HX1, respectively (Supplementary Table 5),
demonstrating the low number of mis-assemblies in HX1.
Similarly, the numbers of transcripts split across multiple
genomic locations in GRCh38, YH2.0, NA12878 and HX1 are
11, 1,213, 1,375 and 358, respectively, demonstrating the high
quality of HX1. Last but not least, short-read alignments showed
that 99.42% and 99.33% of the Illumina reads can be mapped to
GRCh38 and HX1, respectively, suggesting the high quality and
completeness of HX1.
To generate scaffolds on the draft assembly, we applied a
hybrid scaffolding approach6 on the IrysChip data and the HX1
draft assembly: first, we performed de novo assembly of the
IrysChip reads, resulting in 2,346 contigs with N50 of 1.80 Mb
(Supplementary Fig. 5). Next, we stitched HX1 contigs together
based on information from the IrysChip assembly. Together with
HX1 contigs that cannot be anchored, the N50 for the hybrid
assembly improved to 22.0 Mb. We next evaluated the misjoining error rate with a similar strategy as described in a previous
study6, by comparing HX1 scaffolds to the reference human
genome GRCh38 with 100 kb window size (Supplementary
Methods). The HX1 scaffolds had a mis-joining error rate of
1.26%, similar to what was observed in YH2.0 (3.88%), NA12878
(0.55%) and HuRef (1.15%). The mis-joining error rates evaluated
at smaller window sizes (10 kb) are 0.83%, 7.36%, 1.25% and
1.11% for HX1, YH2.0, NA12878 and HuRef, respectively. The
final genome assembly contained 2.93 Gb primary sequences
(including 39.3 Mb N-bases) and 206 Mb alternative haplotypes.
Gap filling on the reference genome by de novo assembly. The
reference human genome GRCh38 contains 966 ‘N-gaps’ as
stretches of Ns (see Supplementary Methods for definition), and
we next assessed whether we can fill in these gaps using the
de novo assembly. The average and median lengths of gaps in
GRCh38 are 180 kb and 998 bp, respectively (Fig. 1a). One
previous study by Chaisson et al.17 used SMRT sequencing to fill
gaps; this study used a local assembly approach, and was able to
close or extend into 31 of the 172 interstitial euchromatic gaps in
GRCh38, adding 1.1-Mb sequences to the genome. Another study
that used SMRT sequencing closed 28 interstitial gaps in GRCh38
with 34 kb of assembled sequences6. Given the availability of
whole-genome de novo assembly, we developed a novel statistical
approach called GFA (gap filling by assembly; Supplementary
Methods and Supplementary Fig. 6). Interestingly, we found that
28.4% (274) of the 966 N-gaps in GRCh38 can be completely
or partially filled by HX1, including 148 gaps on primary
chromosomes (Supplementary Data 1). Among the 328 gaps over
10 kb, 36.0% (118) of them can be completely or partially filled.
The total length of filled or shortened gaps amounts to 7.1 Mb
(Fig. 1b). Compared with previous studies, gaps filled by HX1
have overlap with 48 of the 172 interstitial gaps defined by
Chaisson et al.17, adding 1.8 Mb sequences to GRCh38; however,
among the 48 interstitial gaps closed by us, only 10 were closed by
Chaisson et al., suggesting that these two gap closing methods are
highly complementary. We further evaluated the repeat contents
within the gaps that can be closed by us, and found that simple
repeats and satellite sequences were significantly enriched within
the closed gaps compared with GRCh38 (Po0.001; Fig. 1c). As
an example, one B700-bp gap can be completely and confidently
closed (Fig. 1d), where HX1 can be aligned to both flanking
segments of the gap (Fig. 1e). In summary, together with
Chaisson et al., 69 out of 172 interstitial gaps in GRCh38 can be
closed by long-read sequencing.
Characterization of structural variants and novel sequences.
We next catalogued structural variants (SVs) in the genome of
HX1, by comparing with the GRCh38 genome assembly. From
long-read sequencing data, we identified 9,891 deletions and
10,284 insertions by a previously validated local assembly
approach17 (Fig. 2a and Supplementary Methods). We classified
these SVs by type and by repeat contents of the variant sequence
(Table 2), and found that about half of the deletion and insertion
calls are short tandem repeats or mobile element insertions
(Fig. 2b). We further compared SVs in HX1 with those detected
in CHM1—a haploid genome assembled by long-read sequencing
and analysed by the same SV detection method17, as well as all
SVs catalogued in the 1000 Genomes Project9 (Fig. 2c). Owing to
the increased sensitivity of SV detection from long-read
sequencing, HX1 shares substantially more SVs with the CHM1
genome, compared with all SVs catalogued in the 1000 Genomes
Project. In addition, from the IrysChip physical mapping data, we
identified 783 insertions and 377 deletions, using a previously
validated approach18 (Supplementary Methods). From the shortread sequencing data, we also identified 2,403 deletions and 783
insertions (Supplementary Methods). We found that 82.8%
deletions and 66.9% insertions from IrysChip overlap with SVs
detected by long-read sequencing (Supplementary Fig. 7), but
only 33.7% deletions and 13.8% insertions from short-read
sequencing can be detected by long-read sequencing. Finally, we
demonstrated an example where a 204.7-kb deletion can be
visually identified by all technical platforms (Fig. 2d,e). However,
a 132-bp deletion can only be detected by long-read sequencing
(Fig. 2f); it is located within simple repeat regions with high
(57.2%) GC contents (Fig. 2g) and shows no clear drop in
coverage in Illumina data (Fig. 2f), potentially explaining its
failure to be identified in Illumina data using read-depth method.
However, an alternative split-read-based method19 with
appropriate parameters was able to identify this deletion. To
identify likely functional SVs specific to HX1, we filtered out calls
found in segmental duplications or those shared with CHM1, and
intersected the remaining calls with RefSeq exons. This left us
with 35 exonic deletions and 14 exonic insertions. Among these
exonic SVs, 8 deletions (23%) and 5 insertions (31%) had been
previously observed in the 1000 Genomes Project, including a
homozygous exonic deletion in C1orf168 that completely removes
the tenth and eleventh exons. Interestingly, this deletion was only
observed as a heterozygous event in East Asian populations (allele
frequency (AF) ¼ 1.1%), including 10 Han Chinese and 1
Japanese individuals, and was therefore an Asian-specific SV. In
summary, long-read sequencing offers improved sensitivity in
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
3
ARTICLE
a
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
b
1,000
d
000850F-001-01:2615-4744
160
600
400
120
200
80
2,000
Status
Full
Partial
Count
Count
Count
500
250
0e+00
1,500
chr17:489695-491811
750
40
2e+05
4e+05
6e+05
Length of gaps on GRCh38
1,000
500
c
Percentage of LTR sequences
0.75
0.50
0.25
Percentage of simple repeat sequences
1.00
1.00
Percentage of LINE sequences
0e+00
3e+07
0.75
0.50
0.25
0.00
0.00
HX1
GRCh38
HX1
e
Scale
chr17:
1e+05
2e+05
Length of closed gaps
2,000
3e+05
489,900
490,000
490,100
490,200
0.75
0.50
0.25
0.00
HX1
500
0
0
1.00
0.75
0.50
0.25
0.00
GRCh38
490,300
1,000
1.00
1.00
0.75
0.50
0.25
0.00
GRCh38
HX1
GRCh38
HX1
GRCh38
hg38
1 kb
489,800
1,500
Percentage of satellite sequences
1e+07
2e+07
Length of gaps on GRCh38
Percentage of SINE sequences
0e+00
490,400
490,500
490,600
490,700
490,800
490,900
491,000
491,100
491,200
491,300
491,400
491,500
491,600
491,700
491,800
Assembly from Fragments
AEKP01197830.1
AEKP01197830.1
AEKP01197829.1
Contigs new to GRCh38/(hg38), not carried forward from GRCh37/(hg19)
AEKP01197829.1
Your sequence from blat search
000850F
GENCODE v22 comprehensive transcript set (only basic displayed by default)
RefSeq genes
Repeating elements by repeatmasker
Repeatmasker
Duplications of >1,000 bases of non-repeatmasked sequence
Segmental dups
Figure 1 | Summary of gap filling in GRCh38. (a) Length distribution of all gaps (stretches of ‘N’ in genome sequence) in GRCh38. (b) Length distribution
of all gaps that can be fully or partially closed. (c) Violin plots showing the distribution of LINE, SINE, LTR, simple repeat and satellite in closed gaps and in
GRCh38. (d) A dotplot showing how a gap on 17p13.3 is closed by a contig in HX1. The plot shows comparison of two sequences and each dot indicates a
region of close similarity between them. (e) Genome browser screenshot of the gap region that was closed. The gap is flanked by two contigs that are new
in GRCh38 (not carried forward from GRCh37), yet an HX1 associated contig (000850F-001-01) can completely align to flanking regions, therefore filling
this assembly gap and revising its length from 718 to 731 bp.
identifying SVs, especially those containing repetitive elements,
and some of these SVs may contain functional genomic elements
that are ethnicity-specific.
One of our primary interests lies in the identification of novel
genomic elements that are absent from the reference genome
assembly GRCh38. In total, we identified 12.8 Mb sequences in
HX1 that were not present in GRCh38 primary scaffolds nor its
alternative loci (Supplementary Methods), among which 4.1 Mb
(32%) cannot be mapped to previously published Asian genome
assemblies5,10, suggesting that the majority of HX1-specific
sequences are likely to be found in Asian populations. To
further investigate this, we re-analysed Illumina short-read
sequencing data on a Chinese subject provided by the National
Institute of Standards and Technology (NIST) as standard
benchmarking data. Among the 907 million raw reads, we
achieved a mapping rate of 99.56% to GRCh38; among
unmapped reads, we found that 7.8% can be mapped to the
novel sequence in HX1, confirming that Asian-specific sequence
elements were present but were missed from GRCh38. We next
performed variant calling on the NIST genome. In the regions
shared between HX1 and GRCh38, we identified 3,157,818
4
SNVs on HX1 but 3,852,118 SNVs on GRCh38, suggesting that
Asian-specific reference genome might reduce the number of
called SNVs. However, HX1 might be less appropriate for indel
calling due to higher indel error rate of SMRT long reads. Among
the SNVs called on HX1, 30,713 (B1%) resided within the novel
sequences of HX1. We further examined the contents within the
novel sequences in HX1 (Supplementary Table 6), and found that
microsatellites are significantly enriched in the novel sequences
compared with genome-wide average (75.5% versus 2.1%,
Po10 5). Similarly, simple repeats are also significantly
over-represented (13.0% versus 0.8%, Po10 5). Therefore,
long-read sequencing is especially effective in capturing highly
repetitive regions in the genome.
Several other previously published studies reported novel
sequence elements from Asians, so we performed comparative
analysis. For example, Li et al.12 found 7,330 novel sequences
(4.9 Mb) absent from GRCh37. Re-analysis of their data showed
that 3,440 (3.7 Mb) sequences can be aligned to GRCh38, yet
4,716 (4.4 Mb) can be aligned to HX1. Among these sequences,
3,154 (3.5 Mb) can be aligned to both GRCh38 and HX1.
Furthermore, a recent study identified seven novel genes in an
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
a
1
b
2
3
4
5
Unannotated
27%
6
7
8
9
10
11
12
13
Centromeric
satellites
6%
MEI
21%
14
15
16
18
19
20
21
22
X
Y
MEI
25%
Unannotated
37%
Deletion (>1 kb)
Complex
8%
Other
2%
17
STR
22%
STR
36%
Complex
8%
Insertion (>1 kb)
Centromeric
satellites
4%
Other
4%
Insertion
Deletion
c
4,672
HX1
5,350
7,363
7,682
CHM1
HX1
2,340
19,712
5,633
1000 Genomes
4,010
5,264
2,665
33,823
HX1
CHM1
HX1
Insertion
d
6,978
1000 Genomes
Deletion
f
Long read
Short read
e
GRCh38
Contig 1
Contig 2
g
Scale
chr17:
User track
KRTAP1-1
CTGGCAGCAGCTTGG
AGCAGCTTGGCTG...
AGCAGCGTGGCTG...
AGCTGGTCTCACA...
CTGGCAGCAGCTTGG
70
1 kb
41,041,000
41,041,500
132bp deletion
hg38
41,042,000
GENCODE v22 comprehensive transcript set (only basic displayed by default)
Simple tandem repeats by TRF
GC Percent in 5-Base windows
GC percent
30
All SNPs(142)
Simple nucleotide polymorphisms (dbSNP 142)
Figure 2 | Detection of structural variants by different technologies. (a) Chromosome ideogram showing large-scale (41 kb) deletions (blue) and
insertions (red) identified from long-read sequencing data. (b) Pie chart showing the distribution of different classes of structural variants identified from
long-read sequencing data. (c) Venn diagram showing the overlap of structural variants between HX1, CHM1 and the 1000 Genomes Project for insertions
and deletions, respectively. (d) Integrative Genomics Viewer screenshot of the long-read (upper panel) and short-read alignment (lower panel) around an
B200-kb deletion.(e) Alignment of de novo assembled genome map (blue) to reference genome map (green) where the B200-kb deletion occurs. Black
vertical lines represent labels for the enzyme recognition site. Contig 2 shows identical label patterns as reference, yet contig 1 contains the deletion.
(f) Integrative Genomics Viewer screenshot of long-read (upper panel) and short-read (lower panel) alignment around a 132-bp deletion on KRTAP1-1. This
deletion is visually discernible from long-read sequencing, because the coverage is reduced and half the reads contain the deletion in alignments. However,
read-depth-based method failed to detect this deletion with short read data. (g) Genome browser screenshot of the region surrounding the 132-bp deletion
on KRTAP1-1, demonstrating the presence of simple tandem repeats and the very high GC content of the deletion
Asian genome assembly by comparing with the GRCh37
assembly and examining the transcriptome5. We identified all
seven genes in HX1, but also found that they were present in
GRCh38, indicating the improvements in coverage of GRCh38
over GRCh37 (Supplementary Fig. 8). To assess whether
regulatory functional elements exist in novel sequences in HX1,
we analysed raw sequencing data on five markers (CTCF, DNase I
hypersensitivity, H3K4me1, H3K4me3 and H3K27ac) on the
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
5
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
Table 2 | Structural variants detected in HX1 in comparison to GRCh38.
Repeat category
MEI
L1
L1HS
L1P
Mosaic Alu
AluS
Alu_STR
AluY
SVA
HERV
LTR
MER
Centromeric Satellites
ALR
HSAT
Complex
STR
Other
Unannotated
Total
Insertion
Number
2,172
175
117
152
187
110
93
865
292
23
92
66
608
484
124
822
3,655
256
2,771
10,284
Mean length
600
1,121
2,622
770
496
219
961
286
511
318
533
351
561
612
362
3,065
332
269
210
586
Deletion
Total bases
1,302,834
196,227
306,826
116,964
92,842
24,144
89,407
247,698
149,172
7,320
49,047
23,187
340,995
296,077
44,918
2,519,672
1,212,704
68,982
580,924
6,026,111
Number
2,499
207
146
214
141
127
63
1,153
227
31
108
82
436
387
49
766
2,130
431
3,629
9,891
Mean length
504
552
2,648
579
360
180
439
284
618
263
256
356
501
551
103
3,075
176
243
180
502
Ins/del
Total bases
1,258,260
114,177
386,599
123,871
50,724
22,819
27,662
327,077
140,327
8,141
27,665
29,198
218,243
213,177
5,066
2,355,090
374,176
104,795
654,225
4,964,789
Total events
0.87
0.85
0.80
0.71
1.33
0.87
1.48
0.75
1.29
0.74
0.85
0.80
1.39
1.25
2.53
1.07
1.72
0.59
0.76
1.04
Total bases
1.04
1.72
0.79
0.94
1.83
1.06
3.23
0.76
1.06
0.90
1.77
0.79
1.56
1.39
8.87
1.07
3.24
0.66
0.89
1.21
Alu_STR, Alu element associated with STR; complex, multiple repeat elements are found; MEI, mobile element insertion; STR, short tandem repeat. SVA, short interspersed nuclear elements/variable
number tandem repeat/Alu; HERV, human endogenous retrovirus; LTR, long tandem repeat; MER, medium reiteration frequency repeats; ALR, a-satellite repeat; HSAT, human satellite sequence.
All events are larger than 50 bp.
lymphoblastoid cell line GM12878 from the ENCODE project
(Supplementary Table 7)20. Among all the reads that cannot be
mapped to GRCh38, 1.1% can be mapped to HX1
(Supplementary Table 8) and we performed peak-calling on
each marker. We also performed RNA-Seq on HX1 using
Illumina short-read sequencing, and found that genes closest to
these peaks within the 500-kb flanking region tend to have higher
expression levels than all other genes in the RNA-Seq data
(Supplementary Fig. 9). In summary, we demonstrated that
long-read sequencing can identify potentially functional pieces in
genomes that evade detection by short-read sequencing.
Characterization of transcriptome variation. To evaluate transcriptional diversity and identify novel transcripts, we performed
long-read RNA sequencing (Iso-Seq) on RNA samples extracted
from whole blood. Iso-Seq uses a Circular Consensus Sequence
protocol, where a given transcript is made into a circular molecule and sequenced multiple times through the circle. Given that
transcripts vary greatly in size, we generated four different
libraries: 1–2 kb; 2–3 kb; 3–5 kb and 5 kbþ , each with 10–16
SMRT cells with a total count of 50 cells (Supplementary Table 9
and Supplementary Figs 10 and 11). For the four libraries, on
average each transcript had 11.2, 8.4, 6.9 and 5.2 passes, respectively. We also used short-read RNA-Seq data to correct errors in
Iso-Seq data (Supplementary Table 10 and Supplementary Figs 12
and 13). From Iso-Seq data, we predicted 58,383 high-quality
consensus isoforms at 30,006 loci. Focusing on consensus
isoforms that are highly expressed, we identified 57 isoforms at 42
loci that do not overlap with any GENCODE transcript
(Supplementary Table 11). We experimentally validated several
spliced transcripts, that is, those with more than two predicted
exons (Supplementary Table 12 and Supplementary Figs 14 and
15). For example, from Iso-Seq alignments (Fig. 3a), we identified
a novel transcribed element with at least five exons and six
isoforms (Fig. 3b), and validated the presence of predicted splicing events by PCR (Fig. 3c) and Sanger sequencing (Fig. 3d).
6
This transcribed element is conserved among primates, but
absent from other species (Fig. 3b), and it has not been detected
by short-read RNA-Seq on all nine cell lines from ENCODE
(Fig. 3b). Similarly, from long-read sequencing data, we identified
another two novel genes on 11q13.4 (five exons) and 14q32.2
(four exons), which evades detection by short-read RNA-Seq
(Supplementary Fig. 16), and validated their presence by Sanger
sequencing of cDNA (Supplementary Figs 17 and 18).
Functional analysis of variants with clinical relevance. One
major utility of personal genome sequencing is to identify diseaserelated genetic variants. We identified 3,518,309 SNVs and
625,690 indels from HX1 by comparing Illumina reads with
GRCh38 (Supplementary Figs 19–21). Among them, 223,883
SNVs and 197,402 indels had minor allele frequency r0.01 in the
1000 Genomes Project21, and 74,143 SNVs and 62,260 indels
were not documented in dbSNP22 version 142. Among these
novel variants, 372 SNVs and 50 indels resided in exonic regions
(Supplementary Figs 22 and 23). Despite the high coverage
(143 ), we noted that the coverage in Illumina data was far
more susceptible to GC biases than PacBio data (Fig. 4a), and that
PacBio reads generally had high coverage in a large proportion of
regions that are poorly covered (r5 ) in Illumina data (Fig. 4b).
We next compared these SNVs from HX1 with those from several
previously published personal genomes, including AK1 (ref. 23),
YH, HuRef and NA12878 (Fig. 4c). As expected, we found that
HX1 shared more variants with the two Asian genomes
(1,462,387), compared with the two Caucasian genomes
(1,166,192). Moreover, the number of unique SNVs in HX1
(560,910) lied between two other Asian genomes, AK1 (623,181)
and YH (421,289), yet smaller than two Caucasian genomes,
HuRef (1,162,179) and NA12878 (920,731). Overall, 742,821
SNVs (21.1%) were shared among HX1 and four other personal
genomes.
To identify genetic variants that may be of clinical significance,
we annotated HX1 variants against the ClinVar database24. A
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
a
chr20 (q13.12)
20p13
20p12.3
20p12.2
20p12.1
20p11.23
20p11.21 20p11. 1 q11. 1 20q11.21 q11.22 20q11.23
20q12
20q13.12
20q13.13
20q13. 2
13.32
20q13.33
Coverage
Alignment
b
Scale
chr20:
hg38
10 kb
45,178,000
45,179,000 45,180,000
45,181,000 45,182,000
45,183,000
45,184,000 45,185,000
45,186,000 45,187,000 45,188,000 45,189,000
User supplied track
45,190,000
45,191,000 45,192,000
45,193,000 45,194,000
45,195,000 45,196,000
45,197,000
TCONS_00035151
TCONS_00035154
TCONS_00035156
TCONS_00035157
TCONS_00035158
TCONS_00035159
TCONS_00035160
TCONS_00035162
GENCODE v22 comprehensive transcript set (only basic displayed by default)
RefSeq genes
Transcription levels assayed by RNA-seq on 9 cell lines from ENCODE
Transcription
DNase I hypersensitivity peak plusters from ENCODE (95 cell types )
DNase clusters
7 vertebrates conserved elements
Multiz alignments of 7 vertebrates
Chimp
Rhesus
Mouse
Rat
Dog
Opossum
96
TGATCCCAAAAC
TCACACCTGGAA
Exon 1–2
Exon 1–3
TCACACCCAAAA TTTTACACTCCAT
A2
23
d
M
1
C
M
hx
M
Size
ar
ke
r
9
c
1,500
1,000
500
Exon 1–2–3–4–5
Exon 1–3–4–5
Exon 1–4–5
Exon 1–3–5
TTCCAGGTGAAC
TTCCAGGTGTGA
Exon 3–4
Exon 3–5
Exon 1–5
Exon 2–3
CATTGGGTGTGAC
Exon 1–5
100
Exon 4–5
Figure 3 | Novel gene inferred from Iso-Seq long-read RNA sequencing. (a) Integrative Genomics Viewer on alignment files generated from Iso-Seq.
Over 100 long reads can be mapped to this locus on chr20q13.12 in the GRCh38 assembly. (b) UCSC Genome Browser screenshot on the predicted
transcript models. The transcripts are not detected in RNA-Seq data on nine cell lines in ENCODE. This gene is conserved in primates but not in other
vertebrate species, and is not in segmental duplication regions or simple repeat regions. (c) PCR validation of the transcript TCONS_0035154 by a primer
pair that targeted exons 1 and 5. Several PCR products with different sizes can be detected, representing different isoforms. MC239 is a Caucasian sample
and MA296 is an East Asian sample. (d) Sanger sequencing confirmed the splicing events predicted by the Iso-Seq data.
total of 2,432 variants (2,357 SNVs and 75 indels) in HX1 were
documented in ClinVar24, including 20 variants that were
classified as ‘pathogenic’. However, a simple allele frequency
filter with manual examination showed that 18 of these 20
‘pathogenic’ variants had minor allele frequency 41% in the 1000
Genomes Project, and were unlikely to be highly penetrant
disease causal variants (Fig. 4d). The remaining two variants
included one upstream variant at the MSMB locus that was
annotated as pathogenic for hereditary prostate cancer, as well as
one stop-gain variant within DUOXA2 that was annotated as
pathogenic for thyroid dyshormonogenesis. Manual review of the
literature cited in the two ClinVar records25,26 indicated that both
of them represented erroneous database records. Therefore, no
known pathogenic variants truly exist in HX1. This analysis
highlights the need for extreme caution in interpreting
‘pathogenic’ variants documented in variant databases, and
suggests that frequency filter as well as manual review are
necessary to tease out false positives. With the continuous
improvements of ClinVar, the addition of evidence codes for
clinical interpretation, and the expansion of public allele
frequency databases, this problem is expected to be alleviated in
the future.
Discussion
In the current study, we generated one of the first near-complete
de novo genome assemblies for a Chinese individual. The contig
and scaffold N50 values of the assembly were substantially higher
than previous studies on de novo human genome assembly,
implicating the unique advantage of long-read sequencing in
assembling complex genomes such as the human genome. In
addition, our study also identified a number of previously
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
7
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
a
c
Illumina
AK1
Coverage
200
623,181
150
• • • • • • • •
• • • • •
• •
• •
• •
• •
• •
• •
• •
•
• •
•
• •
100
174,647
157,323
39,415
HuRef
226,930
39,622
50
0.4
GC
0.6
560,910
1,162,179
176,150
38,688
Pacbio
742,821
54,173
150
Coverage
238,277
92,550
0.2
HX1
152,124
400,086
193,090
164,963
145,135
113,273
100
• • • • • • • • • • • • • • •
• • • • •
• • • • • • • •
• • • • •
44,492
194,743
45,451
58,675
50
70,819
118,776
0
0.2
0.4
GC
63,342
98,875
131,234
421,289
920,731
0.6
YH
NA12878
b
d
Pacbio reads coverage on illumina
low-coverage regions
Input: 3.5 million SNVs, 0.6 million indels
Identify variants in ClinVar
8,000
2353 SNVs, 75 indels
Frequency
6,000
Identify pathogenic
variants in ClinVar
19 SNVs, 1 indel
4,000
Remove common variants
2,000
2 SNVs
Manual review of literature
0
0
100
200
300
400
Coverage of pacbio reads
500
None
Figure 4 | Functional annotation and analysis of the genomic variants in HX1. (a) Average coverage versus GC contents for 100-bp windows in Illumina
data and PacBio data, respectively. The mean and s.d. values are shown. (b) Distribution of PacBio coverage for regions that have r5 coverage in
Illumina data. (c) Shared SNVs discovered in HX1, AK1, HuRef, NA12878 and YH. (d) Variant reduction pipeline to identify pathogenic variant; although 20
were annotated as ‘pathogenic’ in ClinVar, careful analysis failed to support any one.
unreported functional genomic elements, some of which can be
transcribed. Therefore, long-read RNA sequencing may complement conventional short-read RNA sequencing to capture the
complete landscape of the human transcriptome.
There are several current limitations to use long-read
sequencing to generate de novo assemblies and analyse personal
genomes. First, although the read length is substantially higher
than short-read sequencing, it is still at the scale of tens of
kilobases due to the technological limitations of the current
sequencing platform. Some highly complex genomic regions may
still not be adequately assayed or assembled, especially when
sequencing coverage is low. With the current development of a
number of nanopore-based long-read sequencing platforms, this
problem may be alleviated by technological innovations. Second,
some gaps in the reference genome are long and are surrounded
by segmental duplications or other highly repetitive sequences27;
therefore, they may not be filled by our long read assembly. For
example, 24 of the 966 gaps are longer than 300 kb based on
current estimation, and none of them can be closed by our
method. Third, compared with the Illumina platform that
enabled $1000 genomes, PacBio long-read sequencing is still
relatively cost-prohibitive to be applied to personal genomes at
large scale. With the continued decline in sequencing cost and the
improvement in sequencing throughput per flow cell, this
problem may be reduced in the future. Fourth, due to the
much higher error rates (especially for indels) of PacBio longread sequencing, variant detection will not be reliable at low
sequencing coverage, so analysis of genetic mutations in personal
genomes still needs to rely on more accurate short-read
8
sequencing data. Finally, in the current study, to demonstrate
the use of reference-free de novo assembly, we used the
NanoChannel arrays for scaffolding the contigs from the
genome assembly, which resulted in B3-fold improvements in
N50 values. However, we acknowledge that we could alternatively
use the reference human genome GRCh38 for generating much
longer scaffolds.
In summary, while short-read-based alignment and variant
calling based on reference genome remain a common practice to
assay personal genomes, de novo assembly by long-read
sequencing may reveal novel and complementary biological
insights. Furthermore, long-read RNA sequencing may identify
novel transcripts that can be missed by short-read RNA
sequencing. Improved understanding and better characterization
of genome functional variation may require the use of a range of
genomic technologies on diverse human populations28.
Methods
Generation of sequence data. Freshly drawn blood samples were obtained from
an anonymous healthy Chinese adult male (HX1) with normal karyotype, using
protocols approved by the Institutional Review Board of Jinan University. HX1
provided written consent for public release of genomic data. For long-read DNA
sequencing, high-molecular-weight DNA was extracted from isolated lymphocytes
using Phenol–Chloroform method and sequenced by the PacBio sequencer RS II,
with the P6-C4 sequencing reagent. For long-read RNA sequencing, total RNA was
extracted from blood using TRIzol extraction reagent and subjected to the IsoSeq
protocol with four library sizes (1–2 kb, 2–3 kb, 3–5 kb and 5 kbþ ), and sequenced
on the PacBio sequencer RS II. For short-read DNA sequencing, genomic DNA
was subject to Illumina TruSeq Nano library preparation protocol, and 151-bp
paired-end reads were generated by Illumina HiSeq X sequencer. For short-read
RNA sequencing, RNA samples were subject to the TruSeq mRNA Library Kit and
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
sequenced on Illumina HiSeq2500 sequencer. For BioNano physical mapping,
DNA extracted from freshly drawn whole blood were subject to manufacturerrecommended protocols for library preparation and optical scanning, with the
default nicking enzyme NT.BspQI.
De novo genome assembly. Long-read de novo genome assembly was performed
with an enhanced version of FALCON software, which improved its performance
in an NFS-based computing cluster. The source code is available on GitHub
(https://github.com/WGLab/EnhancedFALCON). Since we sequenced a diploid
human genome, alternative haplotypes may exist in certain regions with high
variability or large structural variants. As a result, associated contigs are
constructed by FALCON. BioNano de novo genome assembly was performed by
the IrysView software on a computational cluster, with manufacturer-recommended parameters and with molecular length threshold of 150 kb. Quality
assessment of BioNano data used a stringent parameter of ‘-T 1e-9’ suitable for
human genome and a 10% subsampling strategy. Hybrid scaffolding of the
PacBio-based assembly and BioNano-based assembly was performed by the hybrid
scaffolding module packaged with the IrysView software, with details given in
Supplementary Methods. Short-read polishing was performed with BWA-MEM29,
FreeBayes30 and custom python scripts. Consensus quality evaluation was done
by MUMmer16. Mis-joining rate was calculated by custom python scripts on
BWA-MEM alignments.
Gap filling. We developed a GFA procedure for closing gaps in the reference
genome. Any region consisting of continuous runs of N in the target assembly
(GRCh38) is defined as a gap in our method. After merging gaps that are r500 bp
apart, flanking sequences upstream and downstream of the gaps were mapped to
the source assembly (HX1). If two anchor sequences for the same gap can both be
aligned, they will be examined to remove discordant pairs, which include those
alignments with inconsistent orientation, on different contigs, or overlapping with
each other. If only one anchor can be aligned, then the anchor will be extended into
the gap region wherever possible. The source code for GFA has been deposited to
github (https://github.com/WGLab/uniline). Detailed statistical formulation was
given in Supplementary Methods. Briefly, for a gap with length L0 and a predicted
gap with length Lg, we showed that the probability of observing a gap with
difference d ¼ Lg L0 or less extreme is
P
L g L0 d L g L0
Lg 6¼ L0
Pg ¼
Pð 0:5 d 0:5Þ
L g ¼ L0
Gap closing quality score is then calculated by summing up Phred-scaled Pg and
mapping quality score (Pa) assuming independence. The model permits B10% of
flexibility at a threshold of 30 and does not penalize harshly when Lg deviates two
to three times from L0 .
Transcriptome analysis and validation. We first performed isoform-level
clustering using the RS_IsoSeq protocol within the SMRTPortal software. This
protocol essentially performs isoform-level clustering (ICE) and polishes the results
with Quiver. The output from ICE algorithm contains consensus sequences from
full-length reads. The Quiver polished output is classified into either ‘low QV’ or
‘high QV’. Our analysis focused on the high-QC consensus isoform clusters, where
‘Quiver high QV’ is currently set with an expected consensus accuracy of 99%.
Once we obtained the high-quality consensus clusters, we further aligned them to
the GRCh38 reference genome using the GMAP31 algorithm. To improve Iso-Seq
read alignment, we further performed error correction of all original Iso-Seq reads
using LSC, following similar steps in its original publication32. LSC is an algorithm
designed for improving PacBio long-read accuracy by short-read alignment from
Illumina RNA-Seq. Alignment and analysis of short-read RNA sequencing data
was performed by the TopHat33 software and Cufflinks34 software, respectively.
The fragments per kilobase of transcript per million mapped reads (FPKM)
measure was used for quantification of gene expression in the short-read
sequencing data. Comparison of transcript models was performed by the
CuffCompare software within the Cufflinks package. We validated several novel
transcripts with more than two predicted exons, by designing pairs of PCR primers
that are located in two adjacent exons, and performed PCR reactions on the cDNA
samples. The gel bands were cut and DNA was recovered by Qiagen QIAquick kit
(Valencia, CA, USA), and used for Sanger sequencing.
Detection and analysis of genome variation. Detection of structural variation on
short-read sequencing data, long-read sequencing data and BioNano mapping data
were performed by the CNVNator35, local assembly17 and IrysView software,
respectively. SNP and indel calling were performed by the SeqMule36 software,
which integrates BWA-MEM29 for alignment and GATK37/FreeBayes30/
SAMtools38 for variant calling. Comparative analysis of genome assembly was
performed by the MUMmer16 and the LASTZ39, together with custom scripts.
Annotation and functional analysis of genetic variants were performed with the
ANNOVAR40 software, using a variety of built-in databases to infer gene-level
functional annotations41, the allele frequencies in the 1000 Genomes Project21, the
presence/absence in dbSNP22 (version 142) and ClinVar24 (version 20160302), and
the allele frequency in several public databases. Statistical analysis and graph
generation were performed using R (http://www.R-project.org) and custom
Perl/Python scripts.
Data availability. Data generated during the study are available from the authors.
All raw short-read and long-read sequencing data have been deposited in the NCBI
short read archive under study PRJNA301527. The genome assembly and related
results can be accessed at http://hx1.wglab.org/.
References
1. Li, R. et al. De novo assembly of human genomes with massively parallel short
read sequencing. Genome Res. 20, 265–272 (2010).
2. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from
massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518
(2011).
3. Alkan, C., Sajjadian, S. & Eichler, E. E. Limitations of next-generation genome
sequence assembly. Nat. Methods 8, 61–65 (2011).
4. Chaisson, M. J., Wilson, R. K. & Eichler, E. E. Genetic variation and the de novo
assembly of human genomes. Nat. Rev. Genet. 16, 627–640 (2015).
5. Cao, H. et al. De novo assembly of a haplotype-resolved human genome.
Nat. Biotechnol. 33, 617–622 (2015).
6. Pendleton, M. et al. Assembly and diploid architecture of an individual human
genome via single-molecule technologies. Nat. Methods 12, 780–786 (2015).
7. Jakobsson, M. et al. Genotype, haplotype and copy-number variation in
worldwide human populations. Nature 451, 998–1003 (2008).
8. Sudmant, P. H. et al. Global diversity, population stratification, and selection of
human copy number variation. Science 349, aab3761 (2015).
9. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human
genomes. Nature 526, 75–81 (2015).
10. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature
456, 60–65 (2008).
11. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol.
5, e254 (2007).
12. Li, R. et al. Building the sequence map of the human pan-genome. Nat.
Biotechnol. 28, 57–63 (2010).
13. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules.
Science 323, 133–138 (2009).
14. Cao, H., Tegenfeldt, J. O., Austin, R. H. & Chou, S. Y. Gradient nanostructures
for interfacing microfluidics and nanofluidics. Appl. Phys. Lett. 81, 3058–3060
(2002).
15. Chin, C. S. et al. Nonhybrid, finished microbial genome assemblies from longread SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
16. Kurtz, S. et al. Versatile and open software for comparing large genomes.
Genome Biol. 5, R12 (2004).
17. Chaisson, M. J. et al. Resolving the complexity of the human genome using
single-molecule sequencing. Nature 517, 608–611 (2015).
18. Lam, E. T. et al. Genome mapping on nanochannel arrays for structural
variation analysis and sequence assembly. Nat. Biotechnol. 30, 771–776
ð2012Þ:
19. Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic
framework for structural variant discovery. Genome Biol. 15, R84 (2014).
20. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in
the human genome. Nature 489, 57–74 (2012).
21. Auton, A. et al. A global reference for human genetic variation. Nature 526,
68–74 (2015).
22. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids
Res. 29, 308–311 (2001).
23. Kim, J. I. et al. A highly annotated whole-genome sequence of a Korean
individual. Nature 460, 1011–1015 (2009).
24. Landrum, M. J. et al. ClinVar: public archive of relationships among
sequence variation and human phenotype. Nucleic Acids Res. 42, D980–D985
(2014).
25. Thomas, G. et al. Multiple loci identified in a genome-wide association study of
prostate cancer. Nat. Genet. 40, 310–315 (2008).
26. Zamproni, I. et al. Biallelic inactivation of the dual oxidase maturation factor 2
(DUOXA2) gene as a novel cause of congenital hypothyroidism. J. Clin.
Endocrinol. Metab. 93, 605–610 (2008).
27. International Human Genome Sequencing Consortium. Finishing the
euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
28. Whole genome? Nat Genet 47, 963 (2015).
29. Li, H. & Durbin, R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics 25, 1754–1760 (2009).
30. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read
sequencing. Preprint at http://arXiv.org/abs/1207.3907 (2012).
31. Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and
alignment program for mRNA and EST sequences. Bioinformatics 21,
1859–1875 (2005).
32. Au, K. F., Underwood, J. G., Lee, L. & Wong, W. H. Improving PacBio long
read accuracy by short read alignment. PLoS ONE 7, e46679 (2012).
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications
9
ARTICLE
NATURE COMMUNICATIONS | DOI: 10.1038/ncomms12065
33. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of
insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).
34. Trapnell, C. et al. Differential gene and transcript expression analysis of RNAseq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).
35. Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to
discover, genotype, and characterize typical and atypical CNVs from family and
population genome sequencing. Genome Res. 21, 974–984 (2011).
36. Guo, Y., Ding, X., Shen, Y., Lyon, G. J. & Wang, K. SeqMule: automated
pipeline for analysis of human exome/genome sequencing data. Sci. Rep. 5,
14283 (2015).
37. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for
analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303
(2010).
38. Li, H. et al. The Sequence Alignment/Map format and SAMtools.
Bioinformatics 25, 2078–2079 (2009).
39. Harris, R. S. Improved Pairwise Alignment of Genomic DNA (PhD thesis, The
Pennsylvania State Univ., 2007).
40. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of
genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38,
e164 (2010).
41. Dong, C. et al. Comparison and integration of deleteriousness prediction
methods for nonsynonymous SNVs in whole exome sequencing studies. Hum.
Mol. Genet. 24, 2125–2137 (2015).
Author contributions
Acknowledgements
This work is licensed under a Creative Commons Attribution 4.0
International License. The images or other third party material in this
article are included in the article’s Creative Commons license, unless indicated otherwise
in the credit line; if the material is not included under the Creative Commons license,
users will need to obtain permission from the license holder to reproduce the material.
To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
We thank the staff members of the Wuhan Institute of Biotechnology for generating the
long-read DNA sequencing data, and thank Pacific Biosciences for generating the longread RNA sequencing data. We thank Ahmed Naguib from BioNano Genomics for
providing technical assistance. This project is supported by NIH grant HG006465 and
MH108728 (K.W.), HG007635 and HG002385 (E.E.E.), the National Natural Science
Foundation of China 31400922 (L.S.), the Program of Introducing Talents of Discipline
to Universities B14036 (L.S. and S.K.F.) and Leading Talents of Guangdong 2013 (S.K.F.).
E.E.E. is an investigator of the Howard Hughes Medical Institute.
10
L.S., K.-F.S. and K.W. designed the study and guided its execution; L.S., Y.G., C.D., J.H.,
H.Y. and J. Hu performed data analysis; L.S., X.H., Y.G., A.F., N.L., S.G., K.E.L., Q.D.,
Z.W., D.W., F.W., L.W., O.V.E., J.A.K., F.T.-N., V.S., C.-Y.Y., K.L. and L.Z. prepared the
samples and generated the data; G.J.L., Y.G., Y.S. and E.E.E. advised on data analysis. All
authors read and approved the final manuscript.
Additional information
Supplementary Information accompanies this paper at http://www.nature.com/
naturecommunications
Competing financial interests: J.Hu and D.W. are employees of Nextomics Biosciences.
G.J.L. serves on the advisory boards of Omicia, Inc., GenePeeks, Inc. and Good Start
Genetics, Inc. K.W. is a board member and shareholder of Tute Genomics, Inc. and
Nextomics Biosciences. E.E.E. is on the scientific advisory board of DNAnexus, Inc.
and is a consultant for Kunming University of Science and Technology (KUST) as
part of the 1000 China Talent Program. The remaining authors declare no competing
financial interests.
Reprints and permission information is available online at http://npg.nature.com/
reprintsandpermissions/
How to cite this article: Shi, L. et al. Long-read sequencing and de novo assembly of a
Chinese genome. Nat. Commun. 7:12065 doi: 10.1038/ncomms12065 (2016).
r The Author(s) 2016
NATURE COMMUNICATIONS | 7:12065 | DOI: 10.1038/ncomms12065 | www.nature.com/naturecommunications