bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Highly parallel direct RNA sequencing on an array of nanopores
Daniel R. Garalde1, Elizabeth A. Snell1, Daniel Jachimowicz1, Andrew J. Heron1, Mark Bruce1, Joseph
Lloyd1, Anthony Warland1, Nadia Pantic1, Tigist Admassu1, Jonah Ciccone1, Sabrina Serra1, Jemma
Keenan1, Samuel Martin1, Luke McNeill1, Jayne Wallace1, Lakmal Jayasinghe1, Chris Wright1, Javier
Blasco1, Botond Sipos1, Stephen Young1, Sissel Juul2, James Clarke1 & Daniel J Turner1*
* Corresponding author: Daniel J. Turner,
[email protected]
Author affiliations:
1
Oxford Nanopore Technologies Ltd., Oxford, UK
2
Oxford Nanopore Technologies Inc., New York, USA
Abstract
Ribonucleic acid sequencing can allow us to monitor the RNAs present in a sample. This enables us
to detect the presence and nucleotide sequence of viruses, or to build a picture of how active
transcriptional processes are changing – information that is useful for understanding the status and
function of a sample. Oxford Nanopore Technologies’ sequencing technology is capable of
electronically analysing a sample’s DNA directly, and in real-time. In this manuscript we demonstrate
the ability of an array of nanopores to sequence RNA directly, and we apply it to a range of biological
situations. Nanopore technology is the only available sequencing technology that can sequence RNA
directly, rather than depending on reverse transcription and PCR. There are several potential
advantages of this approach over other RNA-seq strategies, including the absence of amplification
and reverse transcription biases, the ability to detect nucleotide analogues and the ability to
generate full-length, strand-specific RNA sequences. Direct RNA sequencing is a completely new way
of analysing the sequence of RNA samples and it will improve the ease and speed of RNA analysis,
while yielding richer biological information.
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Introduction
The overall aims of transcriptomics include comprehensive sequencing of all species of transcript in a
sample, determining the transcriptional structure of genes including splice variants and fusion genes,
quantifying changing expression levels of transcripts in different developmental stages and in
different environments, and ascertaining the level of antisense transcription occurring in an
organism’s cells 1.
To fully satisfy these aims, a method of detecting and characterizing RNA is required that is
accurate, strand-specific, quantitative across a wide dynamic range, does not need prior knowledge
of the genome sequence or the transcripts that are being assayed, is capable of revealing the
presence and identity of modified bases, and can detect antisense transcripts without any possibility
that these are library preparation artefacts 2. It would also be advantageous to be able to generate
sequence reads that are sufficiently long to span multiple splice junctions, for unambiguous
mapping. There is no commercially-available technology which meets all of these requirements.
Transcriptomic analyses have been made possible by the massive throughput of nextgeneration sequencing-by-synthesis platforms: sequencing of complementary DNA (termed RNAseq) has enabled us to build an accurate picture of the active transcriptional patterns within
organisms 1. A single run on such a sequencer can generate millions of individual short reads,
providing a large dynamic range with no upper limit for quantification, potentially allowing precise
quantification of all transcripts without any prior knowledge of their sequence, or even their
existence.
The most commonly used strategy for next generation RNA-seq involves either
polydeoxythymine (poly dT) priming or fragmentation of ribonucleic acid (RNA) followed by
complementary DNA (cDNA) synthesis with random hexamer priming. These cDNA strands are then
typically prepared for analysis on a given sequencing system using a library preparation method that
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
incorporates PCR. However, wherever PCR is used in a library preparation, bias is introduced 3. For
example, the resulting library will have reduced complexity compared to the total mRNA pool,
because not all transcripts will amplify with the same efficiency, causing drop-out of some RNA
species, and excessive amplification of other species. Such PCR duplicates are difficult to distinguish
from genuinely abundant RNA species. Additionally, PCR amplification creates a synthetic copy of
the original RNA strand, and any epigenetic information is lost. Because of this, it would be an
advantage to avoid amplification in the library prep altogether 4.
Exceptions to PCR-based library preparations include FRT-seq on the Illumina platforms, in
which the first strand cDNA synthesis reaction is performed on single strands of RNA which are
hybridised to the flowcell surface 5, and the Direct RNA sequencing on the Helicos platform 6, where
a sequencing-by-synthesis reaction was performed using native RNA strands as the sequencing
template. In both of these cases, PCR amplification is avoided, removing this source of bias, and
meaning that the data obtained should reflect the abundance of RNAs in the original sample more
faithfully. However, these approaches still rely on synthetic copies of the original RNA strand, so do
not retain information about modifications. Both of these approaches necessarily generate short
sequence reads, which is far from ideal because in eukaryotes, transcripts commonly undergo
alternative splicing, which creates multiple different isoforms 7. Gene isoforms can have different
transcription start sites, coding sequences and untranslated regions, in some cases producing very
different functions for the different isoforms 8. Short read sequencing typically cannot span entire
transcripts and frequently does not span both sides of splice junctions adequately, and in either case
new splice variants can be missed 9.
By priming cDNA synthesis from the poly-A tail of eukaryotic transcripts and following a
strand-switching protocol, it is possible to create a library consisting almost entirely of full-length
cDNA strands. The power of coupling this preparation method with a long read sequencing platform
has been shown recently 10, 11. In both cases, the authors were able to obtain full gene structures
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
without the need for assembly, and discovered a large number of new transcript isoforms. The
combination of a strand-switching protocol with a long read sequencer allows generation of strandspecific data, which is essential for comprehensive annotation of the transcriptome 12-14 and to
identify antisense transcription, a mechanism for gene regulation and gene silencing that has been
reported to be widespread in the mammalian transcriptome 15, 16.
In none of these methods, even the ‘Direct RNA’ approach 6 is the RNA strand from the
organism being analysed directly; the sequencing reactions are based on the detection of the
products of a synthesis reaction. This means that library preparation is subject to the processivity
and error-rate limitations of reverse transcription. Even if a strand-switching protocol is followed,
which selects only those cDNA strands that correspond to the full length of the original transcript,
poor reverse transcriptase processivity means that longer transcripts, or those which are difficult to
reverse transcribe, are less well represented than shorter ones. This bias is exacerbated during PCR
amplification. In addition, because reverse transcriptases are capable of using either RNA or DNA as
a template, there is the opportunity to create chimeric products and other artefacts 2. Furthermore,
because cDNA-based approaches to RNA-seq detect enzymatically-incorporated nucleotides rather
than the original RNA strand that was in the sample of interest, it is not possible to detect base
modifications directly.
Oxford Nanopore Technologies have developed a nanopore-based sensing platform which is
capable of sequencing DNA directly without the need to perform an enzymatic synthesis reaction. As
with DNA oligonucleotides, strands of RNA can be made to pass through a protein nanopore
embedded in a hydrophobic membrane when an electrical potential is applied 17. If a motor protein
is used to control the speed of translocation of the RNA strand through the nanopore, a signal can
be obtained which is governed by the identity of the RNA strand. The signal changes if the primary
sequence of the RNA is changed 18.
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Here we demonstrate that our sensing platform is capable of sequencing RNA and RNA
modifications directly, without amplification. Oxford Nanopore’s MinION uses an array of single
protein nanopores, each of which is embedded in a synthetic polymer membrane. At either side of
this membrane is a salt-containing buffer, and when an electrical potential is applied, ions pass
through the nanopores. We are able to measure the current flowing through the individual
nanopores, and when a strand of DNA or RNA passes through a pore, the flow of current is partially
blocked, in a sequence-specific way. We limit the speed with which the strands pass through the
pore using a motor protein which is attached to the template strands during library preparation, but
which does not process along the strand until the strand is captured by a nanopore. Using a custom
Hidden Markov Model or Recurrent Neural Network we are able to convert the current readings to
basecalled sequences.
There are several potential advantages to direct RNA sequencing with nanopores: i) the
approach does not require any prior knowledge of the sequences being detected, and is thus
capable of detecting unknown transcripts; ii) it is necessarily strand-specific; iii) it detects the native
nucleotides, not products of a synthesis reaction, meaning that all nucleotides in the strand,
including those that are modified, synthetic, or irregular have the potential to be identified; iv) by
definition, PCR and other forms of amplification are not used, which removes all amplification
biases; v) the system is capable of sequencing strands that are tens of kilobases in length, meaning
that given a fully processive motor protein, the only limitation on read length is the sample quality;
and vi) it is not affected by the processivity and error-rate limitations of reverse transcription.
This is, to our knowledge, the first parallel truly direct RNA sequencing method. The method
is performed using a consumable chip on a commercially available handheld device, and is
compatible with real-time data analysis.
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Methods
i.
Adapter preparation
a. Double-stranded adapter hybridisation (RT splint adapter and sequencing adapter)
A 4uM stock of hybridized ‘RT splint’ adapter was prepared as follows:
•
2 µl of 100 µM top strand adapter
•
2 µl of 100 µM bottom strand adapter
•
1 µl of 50x annealing buffer (0.5 M Tris.HCl pH 7.5, 2.5 M NaCl)
•
45 µl of deionized water
The mixture was heated to 95 °C for 2 minutes and was then cooled to 20 °C at a rate of 0.5 °C sec-1
b. Pre-loaded sequencing adapter
The motor protein was buffer exchanged into a buffer consisting of 100 mM potassium phosphate
pH 8, 125 mM NaCl, 5 mM EDTA, 0.1% TWEEN-20, and then incubated for 10 minutes with 1 nM
hybridised sequencing adapter, to allow binding. The adapter / enzyme complex was purified using a
3.7x excess of AMPure beads before elution into 50 mM Tris pH 8, 20 mM NaCl. We assume a
concentration of 100 nM.
ii.
Library preparation
We prepared direct RNA libraries using the library prep depicted in Fig. 1. Briefly, 500 ng of poly A+
yeast RNA was ligated to 100 nM pre-annealed RT splint adapter using T4 DNA ligase. The products
were purified using a 1.8 µl of of Agencourt AMPure beads per µl of sample, washing twice in 70%
ethanol. A reverse transcription reaction was performed on the products, using the RT splint as a
primer, before a second 1.8x AMPure purification. Sequencing adapters preloaded with motor
protein were then ligated onto the cDNA : mRNA hybrid duplex, the tether oligonucleotide was
added, and the final library was cleaned up using 1.8 µl of of AMPure beads per µl of sample. The
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
library was then mixed with running buffer to 150 µl and injected into an R9 flow cell and run on a
Mk1b MinION.
iii.
Data analysis
a. Basecalling
To decode nucleotide sequences from our Nanopore ionic current measurements we employ Hidden
Markov Models (HMMs). We assume that the ionic current depends solely on five nucleobases
resident in the nanopore, giving us an HMM with 1024 states, or kmers. The stepwise nature of the
ionic current events is thereby explained as observations of the discrete kmers; an event being a
grouping of raw data points with mean value µ. For each event the value of µ is assumed to be
drawn from a normal distribution with mean µ_k and standard deviation σ_k, the index k identifying
a specific kmer. These emission parameters have been trained using standard methods with data
collected from several MinION runs using a range of sequences. Our HMM is constrained by allowing
transitions primarily between the hidden states which correspond to 0, 1, or 2 nucleobase
movements (often referred to as stay, step, and skip). We allow, by a small uniform probability,
transitions from the current state to any other state. With the HMM thus specified, kmer sequences
are decoded with an iterative variation of the posterior-Viterbi decoding scheme 19. During
iterations, scaling parameters and updates to the HMM transition matrix (the stay, step, and skip
probabilities) are made 20. To finally provide a basecall, the kmer predications are joined in a
maximally overlapping manner.
b. Human rhinovirus
Basecalls for a Human Rhinovirus (HRV) dataset were produced using the HMM described above.
Alignments to the reference were made using LAST 21 parameterised –a 50 –b 100 –e 1200 –y 7000,
and using a custom mismatch matrix derived previously. Alignments were filtered to those covering
at least 60% of the read.
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
c. Read mapping and counting
To map reads to the Saccharaomyces cerevisiae Ensembl release 85 cDNA collection we performed
basecalling with the HMM described above and used bwa mem 22 with the parameters –W 15 –k 8 –
x ont2d. Pairwise alignments were created with Clustal W 23. To quantify the synthetic Lambda
phage transcripts we performed basecalling as before and performed alignments using LAST as with
the HRV samples. For each read the highest scoring alignment was used to identify the origin of the
RNA molecule.
d. Detection of methylated adenosine
To show that nanopore sequencing can detect the presence of methylation we trained HMMs from
two distinct samples: a sample which contains only canonical A bases and a sample which exclusively
contains m6A bases. For simplicity of exposition we choose to model all positions of the reference
sequence independently, that is, the HMM comprises a state-space containing as many states as
there are bases in the reference sequence. Such modelling allows us to relax the assumption that
only five bases contribute to the observed ionic current. To bootstrap the training we do however
use the emission parameterisation of our 5mer basecalling model: we initialise the parameters µ_p
and σ _p to those which can be found in our 5mer model indexed on the 5 bases surrounding the
reference position p. Having trained the HMM models, the final emission parameter sets represent a
consensus “squiggle” across all reads in the two datasets. To compare the consensuses we perform
least squares regression of µ_p and µ_p(meth) at reference positions which are expected to not be
affected by the methylation under the 5mer assumption.
Results
i)
Direct RNA sequencing of yeast transcripts
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
When an RNA strand passes through a nanopore, the current level changes with the sequence in a
similar way to when the analyte is DNA. We prepared a 1D RNA library from yeast poly A+ RNA,
following the library preparation protocol outlined in
Fig. 1
and sequenced the library using R9
flowcells, at a translocation speed of ~90 nucleotides per second per pore on a MinION. We
measured current levels using and created fast5 files for each read.
Fig. 2a
shows the current levels
resulting from the translocation of a single RNA molecule through one of the pores in the array.
When no strand is passing through a given pore, then the maximum ionic current is able to flow,
causing an ‘open pore current’. As a strand begins to pass through the pore in a 3’-5’ direction, the
adapter oligo is detected first, followed by the poly A tail of the transcript, then the body of the
transcript, and the current returns to the open pore level as the transcript exits the pore on the
opposite side of the membrane.
Fast5 files were basecalled and aligned to the Saccharomyces cerevisiae S228C
transcriptome.
Table 1
shows an alignment of a section of a typical read against the closest match in
the transcriptome, YDL229W. We estimate the identity of the complete read to be ~80%.
YDL229W
141 GTTGCTTTCA CTCCAGAAGA AAGATTGATT GGTGATGCTG CCAAGAA--C CAAGCTGCTT TGAAC-CCAA
||||||| || |||||||||| | |||||||| | ||||||
| | | | | ||||||| ||||| || |
141 GTTGCTTCCA CTCCAGAAGA A-GATTGATT AGCGATGCTA ACCGGTACTC CGAGCTGCTC TGAACTCCCA
RNA
. Alignment of Direct RNA sequence data to the closest match from the S228C transcriptome
Table 1
Fig. 2b
and
Table 2
show a selection of abundant poly A+ RNAs identified from the yeast
transcriptome, along with their identities. As might be expected for a growing yeast culture, these
transcripts are mostly ribosomal or involved in metabolism. The coverage of several of these
transcripts appears to decrease towards the 5’ end, raising the possibilities that further engineering
of the current RNA motor protein may improve its processivity, and that more careful handling of
RNA strands may help to reduce degradation.
Transcript
number
Systematic
name
Chromosome
Start
position
End
position
Transcript
length (nt)
Read
count
Description
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
1
2
3
YBR031W
YCR013C
YGL123W
II
III
VII
300166
138402
277617
301254
139049
278381
1,088
647
764
527
108
574
4
YGR192C
VII
882812
883810
998
3,217
5
6
YHR174W
YJR009C
VIII
X
451327
453683
452640
454681
1,313
998
2,029
743
7
YJR123W
X
651901
652578
677
754
8
9
10
11
12
YLR044C
YLR075W
YOL086C
YOL039W
YPR080W
XII
XII
XV
XV
XVI
232390
282927
159548
254297
700594
234081
283592
160594
254617
701970
1,691
665
1,046
320
1,376
1,514
720
1,592
338
1,061
Ribosomal 60S subunit protein L4A
Unknown
Protein component of the small (40S) ribosomal
subunit
Glyceraldehyde-3-phosphate dehydrogenase
(GAPDH), isozyme 3
Enolase II, a phosphopyruvate hydratase
Glyceraldehyde-3-phosphate dehydrogenase
(GAPDH), isozyme 2
Protein component of the small (40S) ribosomal
subunit
Major pyruvate decarboxylase isozyme
Ribosomal 60S subunit protein L10
Alcohol dehydrogenase
Ribosomal protein P2 alpha
Translational elongation factor EF-1 alpha
. Identities and locations of the most abundant transcripts identified
Table 2
Two of the transcripts identified mapped to isozymes of GAPDH, forms of the same enzyme that are
encoded by similar genes at different loci. The sequences of the two genes are 95.8% identical,
differing at multiple variants dispersed throughout the sequences (data not shown). Even though the
single-read accuracy of the direct RNA data is currently below this, further analysis of the reads
mapping to each isozyme imply correct placement: firstly, multimapping (and hence randomly
placed) reads are in the minority, and when these are removed, we are still left with 3217 reads
mapping to YGR192C and 743 mapping to YJR009C (
Table 2
). Secondly, there are 42 positions which
differ between the two isozyme genes, and in the reads mapping to YGR192C, the most frequent
nucleotide at each position was the correct base. The same was true for 42 SNPs positions in the
reads mapping to YRJ009C. Thirdly, when the reads mapping to one isozyme reference sequence
were analysed and variants matching that same reference were counted and plotted, alongside the
variants mapping to the other isozyme reference sequence, a clear difference can be seen (
and
Figs. 2c
).
2d
ii)
Direct RNA sequencing of human rhinovirus
We prepared a 1D RNA template from the ~7.5 kb human rhinovirus (HRV) single-stranded RNA
genome using the library preparation protocol outlined above and sequenced the library as before.
Fig. 3a
shows the current trace (‘squiggle’) from a full-length strand of HRV. The data was basecalled
using the bespoke 1D Hidden Markov Model-based basecaller described previously, and reads were
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
aligned to the norovirus, HRV and influenza A reference sequences using LAST.
Fig. 3b
shows the
alignment, beneath a Savant representation of the NCBI Reference sequence. No off-target mapping
was observed.
iii)
Direct measurement of RNA reduces bias
Having no amplification as part of the library prep means that direct RNA sequencing is free from the
bias and size limitations of this step. Additionally, although reverse transcription was used in the
library prep shown here, the cDNA strand served to improve performance of the system, and was
not sequenced itself, so the biases of this step might be expected to have a greatly reduced
influence on direct RNA sequencing compared to cDNA sequencing. To investigate this, we pooled 3
transcripts, generated synthetically from Lambda phage amplicons, in 10:30:60 ratios and counted
reads of each.
Fig. 4
shows that the read counts were close to the expected values, indicating low
bias.
iv)
Direct RNA sequencing detects modified bases
Direct RNA sequencing measures the current blockade caused by RNA bases in the narrowest part of
the nanopore. Base modifications that affect this current can also be analysed. To determine the
effect of a common RNA modification (m6A) on the current trace we sequenced fully modified and
unmodified strands of the FLuc transcript and aligned the current traces.
Fig. 5
shows a region of this
alignment. The levels corresponding to unmodified and fully modified strands are shown in red and
blue respectively, beneath the base sequence of central nucleotide in the 5-mer used for alignment.
m6A-containing kmers can be discriminated from rAMP-containing kmers by the nanopore.
Discussion
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
In this manuscript we present the first method for sequencing RNA in which the RNA strand is read
directly, rather than by detecting the products of a synthesis reaction. This approach has many
potential advantages over other RNA-seq strategies, namely that i) it is amplification-free, so should
be less biased than methods which rely on either PCR or on clonal amplification, ii) by measuring the
RNA directly, we avoid masking the presence of nucleotide analogues iii) our approach is compatible
with very long reads, making it useful for the study of splice variants, and iv) the method is strandspecific, allowing the unambiguous identification of sense and antisense transcripts.
There is considerable room for improvement in all aspects of the method, particularly those
governing throughput and accuracy. The Oxford Nanopore sequencing technology is modular,
allowing straightforward modification and optimisation of all parts of the library preparation,
sequencing, and data interpretation and analysis processes.
For the work presented here, we operated the motor protein at a speed of around 80
nucleotides per second. Refinements to our DNA sequencing process have allowed us to increase
the DNA motor speed approximately 10-fold, with scope to increase this further. Further screening
of motor protein mutants will allow us to find enzymes with the best combination of steady
movement, good processivity and high processing speed increase, which will increase throughput
and data quality.
The method of library preparation allows the RNA strand to pass 3’ end first through the
nanopore. For DNA templates, we ordinarily sequence 5’ end first, resulting in smooth strand
movement, and hence high quality data. For direct RNA sequencing, we will continue to investigate
sequencing in both 3’-5’ and 5’-3’ directions.
It is not obligatory to synthesise a cDNA strand opposite the RNA template prior to
sequencing, although we have found that the duplex improves throughput, possibly by reducing
intramolecular secondary structure of the RNA. The library preparation method used here involved
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
cDNA synthesis from the poly A tail of the RNA strand. This may not be ideal, because in addition to
depending on high processivity of the reverse transcriptase to synthesise full-length cDNAs, it is
necessary to add a poly A tail to the 3’ ends of strands which do not ordinarily have one. Future work
will involve investigating other approaches to cDNA synthesis, such as random priming. However, it
should be noted that because we do not sequence the cDNA strand, incomplete cDNA synthesis
does not prevent sequencing of an RNA strand, and the error rate of the reverse transcriptase is not
an issue.
It is possible to link the cDNA and RNA strands together with a hairpin adapter, and to
sequence one strand after the other, combining the two reads into a single sequence, in an
analogous way to the 2D reads generated by some Oxford Nanopore’s genomic DNA protocols.
Using a motor protein that has sufficiently good movement on both RNA and DNA polynucleotides it
might therefore be expected that sequencing both RNA and cDNA strands would give improved data
accuracy.
For direct RNA sequencing we are currently using the same E. coli CsgG-derived nanopore
that we use for DNA sequencing 24. This has the advantage of allowing both DNA and RNA strands to
be sequenced together on the same flow cell, but there may be superior nanopores for RNA
sequencing. Improvements in the quality of the nanopore signal from RNA are likely to result from
pore mutations and we will continue to both engineer our current pore and assess different
nanopores in order to find the optimal pore for RNA sequencing. Alongside this work, we will
evaluate alternative buffer compositions and running conditions, to find the best combination for
high throughput and accuracy.
The basecaller used in this work is a naive HMM based solely on event means and has not
been extensively optimised for the purpose of basecalling RNA data. The MinION data is far richer
than this simple model supports. Logical extensions under the HMM framework would be explicit
modeling of event noise or event-length distributions. It would also be beneficial to incorporate
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
dwell time information and common signal irregularities into the model, such as periods of
systematic shifts in current spanning multiple events. Improved HMMs facilitate the training of more
complex models requiring pre-labeled data sets. Recurrent Neural Networks (RNNs) have been
shown to be effective at basecalling MinION experiments with DNA samples 25 and could equally be
applied to RNA experiments. These models are not constrained by the Markov assumption implicit in
the models used here, and can better exploit, for example, long-range correlations in the signal.
In the current work we have shown that MinION data presents a clear systematic difference
between groups of molecules differing only in the presence or absence of modified nucleotides.
Such analysis indicates that it is possible to detect base modifications at a single molecule level, with
single-nucleotide resolution. It is trivial to extend the state space of an HMM to include states for
modified bases, or to allow an RNN to provide classification scores for a larger kmer set, which will
enable us to construct a basecaller to achieve this goal. Although the computational cost of HMMs
will grow exponentially with the number of bases included in the model, many cases of practical
relevance will wish to target limited choices of base analogues at specific loci in a reference
sequence. In such cases more direct methods can be applied which need not suffer from scaling
issues.
At the current level of accuracy, it will be possible to answer important biological questions,
such as those involving splice variation. However, we are confident that with optimisation and
further engineering of the components of the sequencing system described here we will reach a 1D
accuracy and throughput for RNA sequencing that is similar to that of 1D DNA reads. We expect that
the features of nanopore-based direct RNA sequencing, a combination of being amplification-free,
capable of detecting nucleotide analogues, and able to generate full-length, strand-specific reads,
will enable us to gain a far deeper understanding of transcriptomes than has been possible with
indirect, short read sequence data.
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
References
1.
2.
Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics.
(1), 57-63 (2009)
Wu, J.Q. et al. Systematic analysis of transcribed loci in ENCODE regions using RACE
sequencing reveals extensive transcription in the human genome. Genome Biology (1), R3
(2008)
Kozarewa, I. et al. Amplification-free Illumina sequencing-library preparation facilitates
improved mapping and assembly of (G+C)-biased genomes. Nature Methods (4), 291-295
(2009)
Lipson, D. et al. Quantification of the yeast transcriptome by single-molecule sequencing.
Nature Biotechnology (7), 652-658 (2009)
Mamanova, L. et al. FRT-seq: amplification-free, strand-specific transcriptome sequencing.
Nature Methods (2), 130-132 (2010)
Ozsolak, F. et al. Direct RNA sequencing. Nature
(7265), 814-818 (2009)
Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative splicing
complexity in the human transcriptome by high-throughput sequencing. Nature Genetics
(12), 1413-1415 (2008)
Boise, L.H. et al. bcl-x, a bcl-2-related gene that functions as a dominant regulator of
apoptotic cell death. Cell (4), 597-608 (1993)
Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nature
Methods (12), 1177-1184 (2013)
Thomas, S., Underwood, J.G., Tseng, E., Holloway, A.K. & Bench To Basinet Cv, D.C.I.S. Longread sequencing of chicken transcripts and identification of new transcript isoforms. PLoS
One (4), e94650 (2014)
Bolisetty, M.T., Rajadinakaran, G. & Graveley, B.R. Determining exon connectivity in complex
mRNAs by nanopore sequencing. Genome Biol 204 (2015)
David, L. et al. A high-resolution map of transcription in the yeast genome. Proceedings of
the National Academy of Sciences of the USA
(14), 5320-5325 (2006)
Dutrow, N. et al. Dynamic transcriptome of Schizosaccharomyces pombe shown by RNADNA hybrid mapping. Nature Genetics (8), 977-986 (2008)
Wilhelm, B.T. et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at singlenucleotide resolution. Nature
(7199), 1239-1243 (2008)
Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science
(5740), 1559-1563 (2005)
Katayama, S. et al. Antisense transcription in the mammalian transcriptome. Science
(5740), 1564-1566 (2005)
Clamer, M., Hofler, L., Mikhailova, E., Viero, G. & Bayley, H. Detection of 3'-end RNA
uridylation with a protein nanopore. ACS Nano (2), 1364-1374 (2014)
Smith, A.M., Abu-Shumays, R., Akeson, M. & Bernick, D.L. Capture, Unfolding, and Detection
of Individual tRNA Molecules Using a Nanopore Device. Frontiers in Bioengineering and
Biotechnology 91 (2015)
Fariselli, P., Martelli, P. & Casadio, R. A new decoding algorithm for hidden Markov models
improves the prediction of the topology of all-beta membrane proteins. BMC Bioinformatics
(4), S12 (2005)
David, M., Dursi, L., Yao, D., Boutros, P. & Simpson, J. Nanocall: An Open Source Basecaller
(2016)
for Oxford Nanopore Sequencing Data. bioRxiv
Kiełbasa, S., Wan, R., Sato, K., Horton, P. & Frith, M. Adaptive seeds tame genomic sequence
comparison. Genome Research 487-493 (2011)
Li, H. & Durbin, R. Burrows-Wheeler Alignment Tool. http://biobwa.sourceforge.net/bwa.shtml (2012)
Nature Reviews Genetics
10
9
3.
6
4.
5.
6.
7.
27
7
461
40
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
74
10
9
16
103
40
453
309
309
8
3
6
http://dx.doi.org/10.1101/046086
21
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
23.
24.
25.
Larkin, M.A. et al. Clustal W and Clustal X version 2.0. Bioinformatics 23 (21), 2947-2948
(2007)
Deamer, D., Akeson, M. & Branton, D. Three decades of nanopore sequencing. Nat
Biotechnol 34 (5), 518-524 (2016)
Boža, V., Brejová, B. & Vinař, T. DeepNano: Deep Recurrent Neural Networks for Base Calling
in MinION Nanopore Reads. arXiv:1603.09195v1 (2016)
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
AAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTT
full-length
mRNA
ligation of RT splint
AAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTT
reverse transcription
AAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTT
ligation of
sequencing adapters
AAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTT
sequence RNA strand
Fig. 1. One of several possible library preparation methods for Direct RNA sequencing.
a)
open pore current
open pore current
200
Current (pA)
leader oligo
~1920 nt transcript
poly
A
tail
bioRxiv preprint doi: https://doi.org/10.1101/068809
.
this
version
posted
August 12, 2016. The copyright holder for this preprint (which was not
150
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
100
50
0
b)
Read count
1352.5
1362.5 1367.5
Time (s)
3
5
6 7 8
9
Transcript number
1372.5
1377.5
3,000
2,000
1,000
0
c)
1357.5
1
2
4
10 11 12
700
Read count
600
500
SNVs matching YGR192C
SNVs matching YJR009C
400
300
200
100
0
d)
5
20
25
30
10
15
Number of matching variants
35
40
200
Read count
160
SNVs matching YJR009C
SNVs matching YGR192C
120
80
40
0
5
10
15
20
25
30
35
40
Number of matching variants
Fig. 2. Direct RNA sequencing of yeast poly A+ RNA a) Squiggle resulting from translocation of a single
transcript through a pore in the array. The trace is annotated with significant features b) Nucleotide coverage for
transcripts in the sample identified over a range of read counts c) Variants matching each isozyme reference
sequence for all reads mapping to YGR192C d) Variants matching each isozyme reference sequence for all reads
mapping to YJR009C.
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
a)
Current (pA)
120
80
40
0
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Position in HRV genome
b)
norovirus
rhinovirus
influenza A
Z:\Noro_HRV_FluA.fasta
Read depth
Z:\HRV_RNA.sorted.bam
80
40
0
0
2,000
4,000
6,000
Position in HRV genome
Fig. 3. Full-length direct RNA reads of human rhinovirus a) squiggle of full-length read b) aligned full-length
reads
Relative abundance (%)
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
60
Expected values
RNA 1
RNA 2
RNA 3
40
20
0
Mix A
Mix B
Mix C
Fig. 4. Quantitative measurement of Direct RNA reads.
bioRxiv preprint doi: https://doi.org/10.1101/068809. this version posted August 12, 2016. The copyright holder for this preprint (which was not
certified by peer review) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
100
CUGUCGAAUUAAUUCGCCCGGCGAAUGUGCC
Unmodifed
Current (pA)
90
m6A
80
70
60
210
220
230
Position in reference
Fig. 5. Detecting the m6A modification by direct RNA sequencing.