Mgen 000087

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Research Paper

Enrichment by hybridisation of long DNA fragments for


Nanopore sequencing
Sabine E. Eckert,1 Jackie Z.-M. Chan,1 Darren Houniet,1 the PATHSEEK consortium,2 Judy Breuer3
and Graham Speight1
1
Oxford Gene Technology, Begbroke Science Park, Begbroke Hill, Woodstock Road, Begbroke, Oxfordshire OX5 1PF, UK
2
a list of participants can be found in the Acknowledgements
3
UCL Division of Infection & Immunity, Cruciform Building, Gower Street, University College London, London WC1E 6BT, UK

Correspondence: Graham Speight ([email protected])


DOI: 10.1099/mgen.0.000087

Enrichment of DNA by hybridisation is an important tool which enables users to gather target-focused next-generation
sequence data in an economical fashion. Current in-solution methods capture short fragments of around 200–300 nt, poten-
tially missing key structural information such as recombination or translocations often found in viral or bacterial pathogens. The
increasing use of long-read third-generation sequencers requires methods and protocols to be adapted for their specific
requirements. Here, we present a variation of the traditional bait–capture approach which can selectively enrich large frag-
ments of DNA or cDNA from specific bacterial and viral pathogens, for sequencing on long-read sequencers. We enriched
cDNA from cultured influenza virus A, human cytomegalovirus (HCMV) and genomic DNA from two strains of Mycobacterium
tuberculosis (M. tb) from a background of cell line or spiked human DNA. We sequenced the enriched samples on the Oxford
Nanopore MinION and the Illumina MiSeq platform and present an evaluation of the method, together with analysis of the
sequence data. We found that unenriched influenza A and HCMV samples had no reads matching the target organism due to
the high background of DNA from the cell line used to culture the pathogen. In contrast, enriched samples sequenced on the
MinION platform had 57 % and 99 % best-quality on-target reads respectively.

Keywords: enrichment by hybridisation; human Herpes virus/cytomegalovirus; influenza A; Mycobacterium tuberculosis;


nanopore sequencing.
Abbreviations: HCMV, human cytomegalovirus; IGV, Integrated Genome Viewer; M. tb, Mycobacterium tuberculosis;
ONT, Oxford Nanopore Technologies.
Data statement: All supporting data, code and protocols have been provided within the article or through supplementary
data files.

Data Summary regions or whole genomes from microorganisms allows the


The raw datasets from Nanopore and Illumina reads gener- multiplexing of several samples per run whilst maintaining a
ated in this study were deposited in the European Nucleo- high depth of coverage over the regions of interest. The cap-
tide Archive: Study PRJEB12651; http://www.ebi.ac.uk/ena/ ture of viral and bacterial organisms from mixed samples by
data/view/PRJEB12651. in-solution bait hybridisation, followed by high-throughput
sequencing, is advantageous for the evaluation of variant fre-
Introduction quency and deconvolution of PCR duplicates, compared with
the sequencing of PCR-generated amplicons (Samorodnitsky
While the cost of next-generation sequencing has been falling et al., 2015). This enrichment method can be used in a clini-
continuously in recent years, the enrichment of specific DNA cal setting to aid and refine timely diagnosis (Wlodarska
et al., 2015), for example from extensively or totally drug-
Received 3 May 2016; Accepted 26 August 2016 resistant pathogens in a time of antibiotic overuse (Carlet,

ã 2016 The Authors. Published by Microbiology Society 1


This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/).
S. E. Eckert and others

2015). Data from whole-genome sequencing provides a


wealth of information such as identification of resistance Impact Statement
markers carried by the infecting agent(s), allowing for rapid, Our work describes a method for the selective
targeted and personalised treatment. Previous studies have enrichment of known viral or bacterial pathogen
shown that it is possible to bypass the traditional culture- DNA from a background of host DNA for sequenc-
based diagnosis and obtain information by sequencing meta- ing on the Oxford Nanopore MinION long-read
genomic samples, but the throughput is low and the method sequencer. We developed a protocol for enriching
prohibitively costly for routine use (e.g. Doughty et al., 2014, large DNA fragments (>1 kb) by in-solution hybrid-
Loman et al., 2013). A potentially disruptive diagnostic plat- isation, as contrasted to short fragments (200–300
form to sequence enriched bacterial and viral pathogens bp) used for second-generation sequencing. In this
directly from clinical samples has been previously described proof-of-principle experiment, we enriched long
by Brown et al. (2015) and Christiansen et al. (2014). This DNA fragments of influenza virus A, human cyto-
approach employs custom baits to capture genomic material megalovirus and Mycobacterium tuberculosis from
from the target organisms, thereby reducing the amount of their culture cell line or from a laboratory-made
human and commensal DNA in the clinical samples and mixture of bacterial and human genomic DNA. We
allowing greater throughput of samples. However, this believe our method and evaluation of the results will
method is optimised for short-read sequencers such as the be of interest to the growing group of users of long-
Illumina MiSeq and the Ion PGM, and is unsuitable for long- read sequencers (Oxford Nanopore, Pacific Bioscien-
read sequencers. Information from long-read platforms could ces). For example, this method could be used in the
be used, for example, to resolve highly repetitive regions such pathogen field for the whole-genome sequencing of
as those found in cytomegalovirus (Masse et al., 1992), detect small target organisms in mixed/clinical samples and
large structural variations (Jiang et al., 2015) or provide evi- in the identification of structural variants such as
dence of recombination events such as those seen in Chla- translocations in small or large genomes.
mydia trachomatis (Joseph & Read, 2012). Enrichment of
specific genomic fragments by PCR-generated baits for
sequencing on an Oxford Nanopore MinION sequencer was
RNA from influenza virus strain A A/PR/8/34 H1N1, lot
demonstrated by Karamitros & Magiorkinis (2015).
1115 (1.781013 TCID50 ml 1), grown in the MDCK
Here, we present an adaptation of the method used by Brown Cocker Spaniel kidney cell line, was obtained from the Pub-
et al. (2015) and Christiansen et al. (2014), enrichment of lic Health England Culture Collection (#0111041v, Porton
DNA fragments of between 1 and 15 kb for sequencing on Down, UK), and reverse transcribed with NEBNext RNA
long-read platforms. We joined the Oxford Nanopore Tech- First Strand Synthesis Module #E7525 and NEBNext
nologies (ONT) MinION Access Program to assess the suit- mRNA Second Strand Synthesis Module #E6111 (New Eng-
ability of this platform used in combination with the targeted land Biolabs) according to the manufacturer’s instructions.
enrichment method. We compared sequence data from The ds cDNA was cleaned up with DNA Clean & Concen-
unenriched and enriched cultured influenza virus A and trator columns (#D4013, Zymo Research).
HCMV samples, run on the MinION and Illumina MiSeq DNA from HCMV strain Merlin grown in fibroblast cell
platforms. We also mixed cultured Mycobacterium tuberculo- culture (6.75106 copies ml 1, determined by qPCR) was a
sis (M. tb) genomic DNA from two different strains with kind gift of R. Milne at the Department for Virology, UCL
human DNA to evaluate the efficiency of enrichment by Medical School, Royal Free Campus, London, UK.
hybridisation for longer bacterial DNA fragments. We found
that the long genomic fragments were readily purified from a
background of the cell line used for producing the viruses, Sample preparation and long-fragment hybridisation.
or, in case of M. tb, mixed with human DNA. The different workflows for this study are outlined in
Fig. 1. HCMV (500 ng, 6.7107 copies) and M. tb samples
(500 ng) were diluted in TE to an end volume of 80 ml,
Methods and sheared in Covaris g-TUBEs (#520079, Covaris) with
Samples. Mycobacterial genomic DNA from strain H37Rv two passages at 7200 r.p.m./4200 g for 1 min in a desktop
was a kind gift from A. Brown, L. J. Schreuder and T. Parish centrifuge (#5242, Eppendorf). The HCMV genomic DNA
(Barts and The London School of Medicine and Dentistry, was subjected to PreCR (#M0309, New England Biolabs,)
Queen Mary University of London, London, UK), and the enzymatic repair according to the manufacturer’s recom-
extensively drug-resistant clinical M. tb strain C from P. mendations after shearing (Table 1). Influenza virus A
Butcher and J. Dhillon (Institute of Infection and Immunity, samples were not sheared since the cDNA fragments were
St. George’s Hospital, University of London, UK) (Witney size-compatible with Nanopore sequencing. The equivalent
et al., 2015). To test target enrichment, M. tb DNA was of 11012 TCID50 was used for the library preparation
from the enriched influenza virus A cDNA.
mixed with human genomic DNA (Male, #G1471, Prom-
ega), to 10 % (450 ng human DNA, 50 ng M. tb DNA) or Concentrations and fragment sizes were determined with a
90 % (450 ng M. tb, 50 ng human) prior to processing. Qubit fluorometer (dsDNA BR Assay Kit #Q32850, Life

2 Microbial Genomics
Enrichment of long DNA fragments for Nanopore sequencing

(a) (b) (c)


Genomic M. tb DNA/HCMV cDNA Influenza A cDNA HCMV genomic DNA Influenza A cDNA Genomic M. tb DNA/HCMV cDNA Influenza A cDNA

g-TUBE sheared DNA End repair, dA-tailing g-TUBE sheared DNA

Hybridisation Hybridisation
Nanopore adapter ligation

Wash Long-range PCR Wash

Denaturation of captured material off Covaris shearing


End repair, dA-tailing
streptavidin beads, re-annealing

Nanopore leader/hairpin ligation End repair, dA-tailing


End repair, dA-tailing

Agilent adapter ligation


MinION sequencing
Nanopore adapter ligation

PCR1
BLASR, LAST alignment
Long-range PCR

PCR2
End repair, dA-tailing

MiSeq sequencing
Nanopore leader/hairpin ligation

Bowtie alignment
MinION sequencing

Picard analysis
BLASR, LAST alignment

Fig. 1. Workflow for hybridisation and sequencing of long-fragment-enriched pathogen DNA. (a) Enrichment and library preparation for the
Oxford Nanopore MinION sequencer. (b) non-enriched Nanopore libraries, (c) library preparation for long-fragment-enriched Illumina control
experiments.

Technologies), and Agilent Tape Station (Genomic DNA Biotinylated custom RNA baits for the target organisms
ScreenTape #5067–5365 and Genomic DNA Reagents #5067– influenza virus A (49190 baits), HCMV (33809 baits) and
5366; High Sensitivity RNA ScreenTape #5067–5579, High M. tb (224612 baits) were designed with an in-house Perl
Sensitivity RNA ScreenTape Sample Buffer #5067–5580, High script (Depledge et al., 2011), using a database of 4968
Sensitivity RNA ScreenTape Ladder #5067–5581; High Sensi- H1N1 and 2966 H3N2 influenza virus A genomes, 115 par-
tivity D1000 ScreenTape #5067–5584, High Sensitivity D1000 tial and complete HCMV genomes and the M. tb strain
Reagents #5067–5585, Agilent) according to the manu- H37Rv reference genome (NC_018143.2), respectively, and
facturers’ instructions throughout the experiment. manufactured by Agilent. Sheared genomic DNA (HCMV,

Table 1. Average size of DNA fragments at various stages during the protocol
The table shows the modal fragment size post-shear and post-PCR, and mean Nanopore read length (‘pass’, and ‘fail’ read quality) with standard
deviations (SD), of the samples used in this study.

Samples Average (modal) size of DNA fragments Average (mean) read length

Post-shear (nt) Post-PCR (nt) ‘Pass’ reads [nt (SD)] ‘Fail’ reads [nt (SD)]

Influenza virus A non-enriched* 160, 320, 500, 670, 900, 1250, 3000+* 99–4000 1598 (1191) 805 (946)
Influenza virus A enriched* 160, 320, 500, 670, 900, 1250, 3000+* 370–4000 773 (683) 533 (733)
HCMV non-enriched 12 500† –‡ 3176 (2291) 487 (1203)
HCMV enriched 12 500 1587, 5640 1528 (975) 1083 (1099)
M. tb H37Rv 13 800 2000–7000 2402 (1865) 757 (1855)
M. tb strain C 15 000 1500 759 (355) 596 (713)

*Influenza virus A samples were not sheared.


†HCMV fragment size after shearing and PreCR.
‡The non-enriched HCMV sample was not PCR-amplified as enough material was available to proceed directly to sequencing.

http://mgen.microbiologyresearch.org 3
S. E. Eckert and others

M. tb) and cDNA (influenza virus A) samples were hybri- Before each MinION run, flowcells were quality-tested with
dised and captured according to the SureSelectXT Target the script MAP_Platform_QC (MinKnow software version
Enrichment for Illumina Paired-End Multiplexed Sequenc- 0.46.2.8 to 0.49.2.9), then loaded with 12–60 ng of prepared
ing protocol (Version B.1, 2014, 16 h hybridisation). Follow- library, library fuel mix and EP buffer (ONT) as per the
ing capture, samples were heated to 95  C for 3 min, and manufacturer’s instructions, and run with script MAP_48
cooled to 35  C (ramp: 4  C min 1) to release the target Hr_Sequencing_Run, for an average of 26 h.
fragments from the baits bound to streptavidin beads.
Reads were analysed by the Metrichor 2D basecalling (ver-
Half of each hybridised sample was used for Nanopore sions 2.19 to 2.29) cloud-based platform, and the resulting
library preparation with ONT kit versions SQK-MAP003 fast5 files (‘pass’ quality, both strands read while passing
for M. tb strain H37Rv and SQK-MAP004 for HCMV, through the nanopore, resulting in higher confidence; and
influenza virus A, and M. tb strain C. The remainder was ‘fail’, where only one strand is read) converted to fasta for-
used for the generation of Illumina-compatible libraries. mat with Poretools (Loman & Quinlan, 2014). BLASR
(Chaisson & Tesler, 2012) and LAST (Kiełbasa et al., 2011)
were used to align reads to the pathogen reference sequences
Nanopore library preparation, sequencing and analysis. (HCMV herpesvirus HHV-5 GU179001.1, M. tb strain
End repair and dA-tailing of all samples were performed with H37Rv NC_018143.2, and influenza virus A strain H1N1, A/
enzymes from the SureSelectXT kit (#5500–0075, Agilent) and Puerto Rico/8/1934). Command lines were: ‘./blasr input.fa
adaptors and primers from the ONT SQK-MAP003 or SQK- reference.fa -sam -out output.sam’, ‘samtools view -bS out-
MAP004 library preparation kits as specified by the manufac- put.sam > output.bam’ for BLASR and ‘lastdb index_input
turers. Following AMPure XP bead purification (#A63880, input.fasta’, ‘lastal index_input reference.fa -r1 -a1 -b1 >
Beckman Coulter), the dA-tailed samples were ligated to output.maf’, ‘maf-convert -n sam < output.maf > output.
adaptors (ONT) for 15 min at room temperature. They were sam’ for LAST, respectively. Files were further tested with
cleaned up with AMPure XP beads and eluted in H2O. The both aligners against background human (Human_g1k_v37,
ligated DNA was amplified using Long Amp Taq 2x Master www.1000genomes.org) or dog (Ensembl CanFam3.1
mix (#M0287, NEB) and ONT PCR primers with the follow- GCA_000002285.2; NC_006583.3) sequences and the ONT
ing program: 95  C 3 min; 15–18 cycles of 95  C 15 sec, adapters used for PCR.
62  C 15 sec, 65  C 10 min; 65  C 20 min; 4  C hold.
A second round of end repair and dA-tailing was performed Illumina library preparation from long, hybridisation-
on 500 ng of enriched, amplified PCR product using Sure- enriched fragments. For the generation of Illumina librar-
SelectXT reagents as described above, but without purifica- ies, half of each hybridised sample (M. tb strain H37Rv, M.
tion after dA-tailing. Instead, leader/hairpin ligation and tb strain C, HCMV and influenza virus A) were sheared
sample clean-up were performed according to the ONT with a Covaris AFA instrument (Covaris) to 200 nt frag-
protocols for kit SQK-MAP003 (used in the M. tb strain ment size and converted into Illumina-compatible libraries
H37Rv experiments only) or SQK-MAP004. In detail, dA- (Fig. 1c) using Agilent reagents and SureSelectXT protocol
tailed sample, blunt/TA ligase master mix (#M0367, NEB), steps as before. Briefly, samples were end-repaired, dA-
tethered adapter mix and hairpin adapters (ONT) were tailed, had adapters ligated and were PCR-amplified (six
incubated for 10 min at room temperature in protein cycles) as described in the protocol. Following sample puri-
LoBind tubes (#0030108116, Eppendorf) for ligation. fication, the PCR products were re-amplified using post-
Libraries processed according to the ONT SQK-MAP003 capture indexed PCR2 primers for a further 15 cycles.
protocol were cleaned up with AMPure XP beads; those Sequencing (2300 nt read length) was performed on an
made according to the SQK-MAP004 method were purified Illumina MiSeq instrument with paired-end 600V3 kits
using Dynabeads for His-Tag isolation and pulldown (#MS-102-3003) with automatic adapter trimming. Results
(#10103D, Life Technologies) (Fig. 1a). Libraries were from the Illumina MiSeq runs were aligned to the respec-
eluted from the beads by incubation for 10 min at room tive references with Bowtie version 1.1.1 (http://bowtie-bio.
temperature in elution buffer (ONT). Library concentra- sourceforge.net/index.shtml). Additional alignment metrics
tions were typically 2–10 ng ml 1, as assessed by Qubit from the bam files were obtained using the Picard Col-
fluorometer. lectMultipleMetrics (http://broadinstitute.github.io/picard/)
tool, which generates metrics such as percentage of reads
The influenza virus A control sample that did not undergo aligned to a given target as well as coverage data.
hybridisation (75 ng, the equivalent of 2.71011 TCID50)
was end-repaired, dA-tailed and amplified with Long Amp
Taq polymerase as described above. Samples (500 ng) of Results
this PCR product were processed as recommended in the
Comparison of Nanopore library size and read
ONT Genomic DNA sequencing protocol SQK-MAP004.
length
For the non-hybridised HCMV sample, 500 ng (4.2107
copies) were used directly for Nanopore library preparation Table 1 shows the peak sizes of the DNA samples after
(SQK-MAP004) without amplification as enough material shearing, as determined on an Agilent Tape Station. The
was available to proceed directly to sequencing (Fig. 1b). size distribution of the influenza virus A RNA and cDNA

4 Microbial Genomics
Enrichment of long DNA fragments for Nanopore sequencing

prior to processing, showed distinct peaks at 160 nt, 320 nt, observations of Kilianski et al. (2015). A percentage of reads
500 nt, 670 nt, 900 nt, 1.2 kb, 3 kb, (Fig. S1a, b, available in (10–35 %) aligned to the reference by LAST are not aligned
the online Supplementary Material), with fragments up to by BLASR, and vice versa, indicating that neither aligner
15 kb. These were presumably short fragments of the eight works optimally for aligning Nanopore reads to the
influenza virus A segments NC_002016 to NC_002023, and reference.
residual dog cell line DNA. The size of fragments pre- and
post-reverse transcription were broadly similar (Fig. S1).
Due to the shortness of the fragments, influenza virus A Comparison of enriched and non-enriched
samples were not sheared. Nanopore libraries

The HCMV sample (g-TUBE-sheared and PreCR-treated) A total of 13 nanopore sequencing runs were included in
had a tight range of fragment sizes of around 12.8 kb. After our datasets. The average starting pore count per flowcell
PCR amplification, a broad range of fragment sizes both was 215. Most ‘pass’ quality reads aligned to either the target
within and between individual reactions were observed. In organism or the respective cell line, whereas most ‘fail’ qual-
general, the products were about half the size of the original ity reads did not match to target, cell line (Table 3) or
DNA before hybridisation, ranging between 1.6 kb and sequences in the PubMed Nucleotide database (November
5.6 kb. One exception was strain M. tb C, which had 2015). This has been reported elsewhere (e.g. Greninger
shorter (median size 1.5 kb) PCR products. et al., 2015; Kilianski et al., 2015). Regions of alignment
were shorter than read length, possibly due to regional
The Nanopore reads (Table 1) were similarly variable in increase of the error rates within reads.
length, reflecting the input material, as indicated by the
standard deviations in Table 1. Sequenced reads were Analysis of the 42 261 reads obtained from one non-
shorter on average than the PCR products, but with a wide enriched, PCR-amplified influenza virus A cDNA library
range. Reads classified as ‘pass’ quality by the Metrichor run on the Nanopore MinIONTM found 98.9 % ‘pass’ and
platform were longer than ‘fail’ quality reads. Non-hybri- 25.1 % ‘fail’ reads aligned to the MDCK dog cell line used
dised samples had longer read lengths than enriched sam- for cultivation of the virus, whilst only one read aligned to
ples, either due to DNA damage during the hybridisation the influenza virus A reference H1N1. After hybridisation
and wash processes, or preferential amplification of shorter and amplification, 57.2 % of ‘pass’ and 9.5 % of ‘fail’ reads
fragments during PCR. (34 211 reads in total) from one Nanopore run could be
aligned to influenza virus A. This amounts to an average
read depth of the influenza virus A genome of 62.9. Fig. 2
Comparison of BLASR and LAST aligners shows uneven distribution of reads per fragment, with dis-
tinct peaks of increased coverage. This probably reflects the
We used BLASR (Chaisson & Tesler, 2012) and LAST (Kieł-
size distribution of the input RNA (Fig. S1a) rather than
basa et al., 2011), with the settings used in Quick et al.
effects of reverse transcription, hybridisation or PCR bias.
(2014) for the alignment of Nanopore reads to their respec-
The frequency of cell line reads in influenza virus A-
tive references (pathogen and human/dog cell line). Table 2
enriched samples dropped to 28.4 % (‘pass’) and 2.9 %
shows statistics for the similarities to the target references
(‘fail’) (Table 3).
obtained with the two aligners. We found that BLASR align-
ment of reads showed slightly higher identity to the referen- The unenriched HCMV library (a total of 432 reads from one
ces, shorter aligned regions and lower standard deviation. flowcell) produced four reads (0.2 % of total) matching the
The LAST aligner produced longer alignments with lower HCMV reference HHV-5, while 47 reads (10.9 % of total)
identity and higher standard deviation. This is similar to the matched the human_g1k_v37 reference. After enrichment of

Table 2. Mean similarity and length (with standard deviations, SD) of Nanopore reads aligned to the pathogen targets using BLASR
and LAST

Sample BLASR alignment LAST alignment

Mean similarity of reads to Mean length of Mean similarity of reads to Mean length of
target [% (SD)] alignment [nt (SD)] target [% (SD)] alignment [nt (SD)]

Influenza virus A 79 (6.4) 201 (121) 74.8 (6.8) 314 (136)


enriched
HCMV enriched 76.8 (6.8) 946 (909) 72.1 (7.9) 1413 (986)
M. tb H37Rv 76.9 (6.5) 949 (1022) 69.2 (7.6) 1667 (1154)
enriched
M. tb strain C 79.4 (6.7) 287 (241) 70.7 (7.6) 523 (744)
enriched

http://mgen.microbiologyresearch.org 5
S. E. Eckert and others

Table 3. Percentages of Nanopore reads aligned to target pathogen and cell line/human DNA in the samples prepared for this
study
Alignment statistics are the combined results from BLASR and LAST.

Sample Reads aligned to target pathogen Reads aligned to cell line/human DNA

Percentage of Percentage Percentage ‘fail’ Percentage of Percentage Percentage ‘fail’


total reads ‘pass’ reads reads total reads ‘pass’ reads reads

Influenza virus A non- 0.0 0.0 0.0 29.9 98.9 25.1


enriched
Influenza virus A 10.9 57.2 9.5 3.7 28.4 2.9
enriched
HCMV non-enriched 0.2 0.0 1.0 10.9 100.0 7.2
HCMV enriched 45.5 98.7 35.0 1.1 1.3 1.0
10 % M. tb H37Rv 7.3 32.8 3.9 11.4 46.6 6.6
enriched
90 % M. tb H37Rv 3.4 5.9 1.7 10.5 12.6 9.2
enriched
10 % M. tb strain C 0.8 5.9 0.8 8.2 23.5 8.2
enriched
90 % M. tb strain C 4.4 88.1 3.9 5.2 10.3 5.2
enriched

the DNA with the HCMV-specific bait set, we obtained 890 363, 1 538 580–1 539 822, 2 635 594–2 640 242,
37 589 reads from three runs, with almost all (98.7 %) 3 544 391–3 547 252 and 3 788 312–3 789 669 of strain
‘pass’ reads and 35 % of ‘fail’ reads aligning to the HCMV H37Rv (Fig. 4). These regions have been highly enriched
reference (Table 3). This amounts to an average read depth compared with the background of human DNA, and also
of 87.6of the HCMV genome. Panels a in Fig. 3 show the compared with the rest of the M. tb genome. The sample orig-
coverage of all Nanopore reads aligned to the reference. inally containing 10 % H37Rv DNA showed the highest rate
of reads aligning to the reference, while both 90 % M. tb
A comparison of the consensus sequence generated from
the enriched HCMV reads aligned to the HCMV HHV-5 DNA samples (H37Rv and strain C) show enrichment mainly
reference using the genomic similarity search tool YASS in the transposase regions.
(Noé & Kucherov, 2005) revealed that the former had
99.4 % similarity to the reference (233 854 of 235 230 Sequencing of enriched long fragments on the
nucleotide residues). The conflicting/mismatch residues Illumina MiSeq
are mostly gaps in the Nanopore consensus sequence at
positions 46 364–46 433 (proteins UL34 and UL35), To assess the success of the long fragment hybridisation,
147 820–147 830 (helicase–primase subunit UL102), Illumina libraries were generated from the remaining half of
194 363–194 698 and 195 851–195 977. The last two the hybridised material, and sequenced on a MiSeq instru-
regions of difference coincide with inverted repeat regions ment (results shown in Table 4). A high percentage of influ-
(194 344–195 667, 195 090–197 626) (Masse et al., enza virus A and HCMV reads from long enriched
1992). A number of mismatches to the reference HHV-5 fragments aligned to the target reference in both Illumina
were identified upstream of base 1270; these were due to and Nanopore ‘pass’ reads.
low coverage of this region by Nanopore reads. We found Illumina-generated reads showed higher percentages of
regions with low (<5) coverage had a high number of alignment than Nanopore reads, presumably due to the
mismatches compared with the reference, but areas of lower error rates. Illumina libraries generated from the
greater coverage matched near-perfectly. hybridisation of long fragments, particularly the indepen-
For M. tb strain H37Rv, we obtained a total of 2028 ‘pass’ and dently generated, 10 % M. tb H37Rv libraries 1–4 in
9961 ‘fail’ reads (equivalent to 0.077coverage) from four Table 4, show successful enrichment of mycobacterial DNA,
flowcells, for the strain M. tb C, 202 ‘pass’ and 46 711 ‘fail’ with 56–96 % of reads aligning to the H37Rv genome.
reads (0.182) were obtained from three flowcells. Localized Results for M. tb strain C show a relatively low rate of align-
areas with high coverage were found in both strains; these ment of reads to the H37Rv genome in both Nanopore and
were found to correspond to open-reading frames encoding Illumina experiments. This could be due to less successful
transposases LH57_07500, LH57_18955, LH57_18175, and enrichment, and an imperfect match of the M. tb strain C
LH57_04320, at positions 887 429–887 488, 889 044– reads aligned to strain H37Rv, which has 98.9 % identity to

6 Microbial Genomics
Enrichment of long DNA fragments for Nanopore sequencing

Table 4. Illumina MiSeq runs of independent hybridisations of long fragments


The table shows the statistics of alignment generated by Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) and coverage generated by Picard
(http://broadinstitute.github.io/picard/).

Sample Fragment Number of reads Percentage of reads Mean depth of Percentage of target
hybridisation aligned to target aligned pathogen bases
pathogen to target pathogen coverage covered at 10

Influenza virus A Long 2 664 967 59 2089 97


HCMV Long 1 765 332 94 957 96
10 % M. tb H37Rv (1) Long 6 906 339 84 295 99
10 % M. tb H37Rv (2) Long 1 071 332 62 50 99
10 % M. tb H37Rv (3) Long 8 193 188 96 315 99
10 % M. tb H37Rv (4) Long 2 942 898 56 100 98
90 % M. tb H37Rv (5) Long 2 258 868 93 99 99
90 % M. tb H37Rv (6) Long 6 978 112 97 297 99
90 % M. tb H37Rv (7) Long 9 926 437 87 342 99
90 % M. tb H37Rv (8) Long 15 382 452 96 521 98
M. tb H37Rv (9) Short 3 982 148 99 169 99
90 % M. tb strain C (1) Long 689 141 18 24 85
M. tb strain C (2) Short 2 980 023 99 115 97

a consensus sequence generated from our Illumina- (2015) has shown that detection of moderate to high titres
sequenced M. tb strain C. of pathogen DNA (chikungunya virus, Ebola and hepatitis
C virus) from human blood samples is possible using Nano-
Results from the enriched influenza virus A (Fig. S1c) show
pore sequencing. However, this direct sequencing approach
concordance with the coverage by Nanopore results
(Fig. 2). The unevenness of the coverage is presumably a is inefficient if the region of interest is a small subset of the
result of the prevalence of short fragments in the original total DNA, the target is of low titre, or if high coverage is
RNA sample (Fig. S1a), reverse-transcribed to cDNA (Fig. required for strain typing and variant identification. In our
S1b). Illumina reads (Fig. 3b) generally show less even cov- Nanopore sequencing experiments with un-enriched influ-
erage of the HCMV genome compared with Nanopore enza virus A and HCMV DNA (from cell cultures), we
reads (Fig. 3a). Fifteen (out of a set of 23 525) aligned detected very low numbers of reads from the pathogen
Nanopore reads span the repetitive replication origin oriLyt compared with those from the host cell line. In contrast,
at position 94 488–94 588 (Chen et al., 1999) (Fig. S2c, sequencing data from enriched DNA produced good cover-
d). The complete (3.5 M aligned reads) Illumina dataset age of the influenza virus A and HCMV genomes and par-
(Table 4) has a 100 bp gap in the alignment at this repeti- tial coverage of the M. tb genome. Control experiments
tive position (Fig. S2a, b). Two Nanopore reads cover the using Illumina sequencing to assess the quality of enrich-
inverted repeat region 194 293–195 565, while no Illumina ment (Table 4, Fig. 4) showed good overall and minimum
reads aligned in this gap, and almost all Illumina reads in coverage, similar to the sample enriched by short-fragment
the adjacent 2.5 kb region show a mapping quality equal to hybridisation (Brown et al., 2015; Christiansen et al., 2014,
zero, when visualised in the IGV (Fig. S2g, h). A similar sample 9 and M. tb strain C sample 2 in Table 4), indicating
outcome has been observed for a comparison of Nanopore that the enrichment of long fragments does not introduce
and 454 reads for human herpesvirus type 1 (Karamitros bias. Preferential enrichment of certain regions (Fig. 4)
et al., 2016). Areas with increased coverage can also be seems to be due to redundancy of the captured sequence, in
observed in Nanopore- and Illumina-generated datasets this case the transposases.
(Fig. 4) in M. tb. Here, this is presumably due to the The drawbacks of our method, compared with the high-
redundancy of transposase-encoding sequences, which throughput protocol used by Brown et al. (2015), and
could result in localised increased aligning of reads. Christiansen et al. (2014), were lower target coverage and
throughput. Enrichment and library preparation take
Discussion approximately 28 h and include a 16 h hybridisation step
This study explores the capture of specific, long DNA frag- and 3–4 h of long-range PCR. In the future, the enrichment
ments for sequencing on a long-read platform, the Oxford step could be shortened to 4 h by using a different hybrid-
Nanopore MinION instrument. We demonstrate that our isation protocol, and PCR amplification could be replaced
method can be used to enrich large, specific regions of inter- with whole-genome amplification. Addition of molecular
est in mixed samples. Previous work by Greninger et al. barcodes would allow pooling of several samples to be run

http://mgen.microbiologyresearch.org 7
S. E. Eckert and others

NC_002023.1, PB2

NC_002021.1,PB2, PB1-F2

NC_002022.1,PA

NC_002017.1,HA

NC_002019.1,NP

NC_002018.1,NA

NC_002016.1,M1, M2

NC_002020.1, NS1, NEP

Fig. 2. Coverage profile of Nanopore reads from enriched influenza virus A cDNA, aligned to reference H1N1 with BLASR, coverage visu-
alized in the Integrated Genome Viewer (IGV, Robinson et al., 2011; Thorvaldsdóttir et al., 2013). Maximum read depths for the fragments
according to IGV are: 139 (NC_002023.1), 139 (NC_002021.1), 225 (NC_002022.1), 51 (NC_002017.1), 219 (NC_002019.1), 1589
(NC_002018.1), 185 (NC_002016.1), 16 (NC_002020.1).

(a)
(b)

(a)
(b)

(a)
(b)

(a)
(b)

Fig. 3. Nanopore reads of HCMV (a), aligned with BLASR to strain HHV-5, coverage visualized with IGV. Maximum read depth 239.
Panel (b), shows an Illumina run generated from a long-fragment enrichment, downsampled to similar coverage of 200–300.

8 Microbial Genomics
Enrichment of long DNA fragments for Nanopore sequencing

(a)

(b)

(c)

(d)

Fig. 4. Coverage by Nanopore reads from DNA of M. tb H37Rv (a), M. tb strain C (b) and Illumina-sequenced long-fragment-enriched
M. tb H37Rv (c) and M. tb strain C (d), shows a region (886 000–893 000) of high coverage in both Nanopore and Illumina experiments.
Nanopore reads were aligned with BLASR, Illumina reads with Bowtie, visualised in IGV. Maximum read depth, as determined by IGV, is 17
(a), 22 (b), 331 (c) and 277 (d).

simultaneously on one MinION flowcell. This, coupled with Speight, Jacqueline Chan, Jolyon Holdstock, Sabine E. Eckert, Mike
increasing speed, accuracy and throughput of MinION McAndrew and Amanda Brown (OGT).
reads (e.g. results in Norris et al., 2016), will reduce the Sabine E. Eckert is a Nanopore shareholder.
time and number of reads necessary for strain and variant
identification, making this method amenable for diagnostic
purposes. The relatively inexpensive and small-footprint References
MinION sequencers have been used in settings where con- Ammar, R., Paton, T. A., Torti, D., Shlien, A. & Bader, G. D. (2015).
ventional Illumina sequencing would be difficult (Quick Long read nanopore sequencing for detection of HLA and CYP2D6
et al., 2016). variants and haplotypes. F1000Res 4, 17.
Ashton, P. M., Nair, S., Dallman, T., Rubino, S., Rabsch, W.,
We see the main application of our method of enriching
Mwaigwisya, S., Wain, J. & O’Grady, J. (2015). MinION nanopore
long fragments in the detection of structural variants and in sequencing identifies the position and structure of a bacterial antibi-
generating comprehensive coverage of specific target regions otic resistance island. Nat Biotechnol 33, 296–300.
by long-read sequencing. Nanopore sequencing has previ- Brown, A. C., Bryant, J. M., Einer-Jensen, K., Holdstock, J.,
ously been used to detect structural variants in pathogenic Houniet, D. T., Chan, J. Z., Depledge, D. P., Nikolayevskyy, V.,
bacteria (Ashton et al., 2014), human DNA samples Broda, A. & other authors (2015). Rapid whole-genome sequencing
(Ammar et al., 2015) or human cancer cell lines (Norris of Mycobacterium tuberculosis Isolates directly from clinical samples.
et al., 2016); we believe our method could be employed as a J Clin Microbiol 53, 2230–2237.
non-amplicon-based alternative for this application, Carlet, J. (2015). The world alliance against antibiotic resistance: con-
improving library complexity and uniformity of the sample, sensus for a declaration. Clin Infect Dis 60, 1837–1841.
and aiding the detection of single-nucleotide variants Chaisson, M. J. & Tesler, G. (2012). Mapping single molecule
(Samorodnitsky et al., 2015). As the enrichment approach is sequencing reads using basic local alignment with successive refine-
platform-agnostic, it could also be used to generate libraries ment (BLASR): application and theory. BMC Bioinformatics 13, 238.
compatible with the other long-read sequencers, benefitting Chen, Z., Sugano, S. & Watanabe, S. (1999). A 189-bp repeat region
the field of research into structural variation. within the human cytomegalovirus replication origin contains a
sequence dispensable but irreplaceable with other sequences. Virology
258, 240–248.
Acknowledgements Christiansen, M. T., Brown, A. C., Kundu, S., Tutill, H. J., Williams, R.,
Brown, J. R., Holdstock, J., Holland, M. J., Stevenson, S. & other
We would like to thank Dietrich Lueersen, David Blaney and Dan
authors (2014). Whole-genome enrichment and sequencing of Chla-
Swan for their help in the analysis of the data; Richard Milne at the
mydia trachomatis directly from clinical samples. BMC Infect Dis 14,
Department for Virology, UCL Medical School (Royal Free Campus,
591.
Rowland Hill Street, London, UK) for the kind gift of the CMV
Merlin strain, Amanda Brown, Lise J Schreuder and Tanya Parish at Depledge, D. P., Palser, A. L., Watson, S. J., Lai, I. Y., Gray, E. R.,
Barts and The London School of Medicine and Dentistry (Queen Grant, P., Kanda, R. K., Leproust, E., Kellam, P. & Breuer, J. (2011).
Mary University of London, London, UK), for strain M.tb H37Rv, Specific capture and whole-genome sequencing of viruses from clini-
Philip Butcher and Jasvir Dhillon at St. George’s Hospital (Univer- cal samples. PLoS One 6, e27805.
sity of London, London, UK) for the generous donation of strain Doughty, E. L., Sergeant, M. J., Adetifa, I., Antonio, M. & Pallen, M. J.
M. tb C. (2014). Culture-independent detection and characterisation of Myco-
Past and present members of the PATHSEEK consortium are: Judith bacterium tuberculosis and M. africanum in sputum samples using
Breuer, Rachel Williams, Mette Theilgaard Christiansen, Josie Bryant, shotgun metagenomics on a benchtop sequencer. PeerJ 2, e585.
Sofia Morfopoulou, Helena Tutill, Erika Yara-Romero, Charlotte Greninger, A. L., Naccache, S. N., Federman, S., Yu, G., Mbala, P.,
Williams and Dan Depledge (UCL); Martin Schutten, Saskia Smits, Bres, V., Stryke, D., Bouquet, J., Somasekar, S. & other authors
Georges M.G.M. Verjans, Freek B. van Loenen, Anne van der Linden (2015). Rapid metagenomic identification of viral pathogens in clini-
and Albert Osterhaus (Erasmus MC); Katja Einer-Jensen, Martin cal samples by real-time nanopore sequencing analysis. Genome Med
Ludvigsen and Roald Forsberg (CLC Bio); James Clough, Graham 7, 99.

http://mgen.microbiologyresearch.org 9
S. E. Eckert and others

Jiang, J., Gu, J., Zhang, L., Zhang, C., Deng, X., Dou, T., Zhao, G. & (2016). Real-time, portable genome sequencing for Ebola surveil-
Zhou, Y. (2015). Comparing Mycobacterium tuberculosis genomes lance. Nature 530, 228–232.
using genome topology networks. BMC Genomics 16, 85. Quick, J., Quinlan, A. R. & Loman, N. J. (2014). A reference bacterial
Joseph, S. J. & Read, T. D. (2012). Genome-wide recombination in genome dataset generated on the MinION portable single-molecule
Chlamydia trachomatis. Nat Genet 44, 364–366. nanopore sequencer. Gigascience 3, 22.
Karamitros, T. & Magiorkinis, G. (2015). A novel method for the mul- Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M.,
tiplexed target enrichment of MinION next generation sequencing Lander, E. S., Getz, G. & Mesirov, J. P. (2011). Integrative genomics
libraries using PCR-generated baits. Nucleic Acids Res 43, e152–e152. viewer. Nat Biotechnol 29, 24–26.
Karamitros, T., Harrison, I., Piorkowska, R., Katzourakis, A., Samorodnitsky, E., Jewell, B. M., Hagopian, R., Miya, J., Wing, M. R.,
Magiorkinis, G. & Mbisa, J. L. (2016). De Novo assembly of human Lyon, E., Damodaran, S., Bhatt, D., Reeser, J. W. & other authors
herpes virus type 1 (HHV-1) genome, mining of non-canonical (2015). Evaluation of hybridization capture versus amplicon-based
structures and detection of novel drug-resistance mutations using methods for whole-exome sequencing. Hum Mutat 36, 903–914.
short- and long-read next generation sequencing technologies. PLoS Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. (2013). Integrative
One 11, e0157600. Genomics Viewer (IGV): high-performance genomics data visualiza-
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. (2011). tion and exploration. Brief Bioinform 14, 178–192.
Adaptive seeds tame genomic sequence comparison. Genome Res 21, Witney, A. A., Gould, K. A., Arnold, A., Coleman, D., Delgado, R.,
487–493. Dhillon, J., Pond, M. J., Pope, C. F., Planche, T. D. & other authors
Kilianski, A., Haas, J. L., Corriveau, E. J., Liem, A. T., Willis, K. L., (2015). Clinical application of whole-genome sequencing to inform
Kadavy, D. R., Rosenzweig, C. N. & Minot, S. S. (2015). Bacterial and treatment for multidrug-resistant tuberculosis cases. J Clin Microbiol
viral identification and differentiation by amplicon sequencing on the 53, 1473–1483.
minION nanopore sequencer. Gigascience 4, 12. Wlodarska, M., Johnston, J. C., Gardy, J. L. & Tang, P. (2015). A micro-
Loman, N. J. & Quinlan, A. R. (2014). Poretools: a toolkit for analyzing biological revolution meets an ancient disease: improving the man-
nanopore sequence data. Bioinformatics 30, 3399–3401. agement of tuberculosis with genomics. Clin Microbiol Rev 28, 523–
539.
Loman, N. J., Constantinidou, C., Christner, M., Rohde, H., Chan, J. Z.,
Quick, J., Weir, J. C., Quince, C. & Smith, G. P. & other authors (2013).
A culture-independent sequence-based metagenomics approach to
the investigation of an outbreak of Shiga-toxigenic Escherichia coli Data Bibliography
O104:H4. Jama 309, 1502–1510. The following reference sequences were used:
Masse, M. J., Karlin, S., Schachtel, G. A. & Mocarski, E. S. (1992).
1. Human CMV herpesvirus: HHV-5 GU1790079001.1, http://
Human cytomegalovirus origin of DNA replication (oriLyt) resides
within a highly complex repetitive region. Proc Natl Acad Sci U S A
www.ncbi.nlm.nih.gov/nuccore/GU179001.1
89, 5246–5250. 2. M. tb: strain H37Rv NC_018143.018143.2, http://www.ncbi.
Norris, A. L., Workman, R. E., Fan, Y., Eshleman, J. R. & Timp, W. nlm.nih.gov/nuccore/NC_018143.2
(2016). Nanopore sequencing detects structural variants in cancer. 3. Influenza virus: strain H1N1, A/Puerto Rico/8/1934, http://
Cancer Biol Ther 17, 246–253.
www.ncbi.nlm.nih.gov/nuccore/8486138
Noé, L. & Kucherov, G. (2005). YASS: enhancing the sensitivity of
DNA similarity search. Nucleic Acids Res 33, W540–543. 4. Human: Human_g1k_v37, www.1000genomes.org,
Quick, J., Loman, N. J., Duraffour, S., Simpson, J. T., Severi, E., 5. Dog: CanFam3.1 GCA_000002285.2, NC_006583.3, http://
Cowley, L., Bore, J. A., Koundouno, R., Dudas, G. & other authors www.ncbi.nlm.nih.gov/nuccore/NC_006583.3

10 Microbial Genomics

You might also like