Computational Methods For 16S Metabarcoding Studies Using Nanopore Sequencing Data

CSBJ 436 No.
of Pages 11, Model 5G

30 January 2020
Computational and Structural Biotechnology Journal xxx (xxxx) xxx

1
j o u r n a l h o m e p a g e : w w w . e l s e v i e r . co m / l o c a t e / c s b j
2 Review
6
4 Computational methods for 16S metabarcoding studies using Nanopore
7
5 sequencing data
8 a,b,c c a,b,c c,⇑
Andres Santos , Ronny van Aerle , Leticia Barrientos , Jaime Martinez-Urtaza
a
9 Applied and Molecular Biology Laboratory, Centre of Excellence in Translational Medicine, Universidad de La Frontera, Avenida Alemania 0458, 4810296 Temuco, Chile
b
10 Scientific and Technological Bioresource Nucleus, Universidad de La Frontera, Avenida Francisco Salazar 01145, 481123 Temuco, Chile
c
11 Centre for Environment, Fisheries and Aquaculture Science (Cefas), Barrack Road, Weymouth, Dorset DT4 8UB, UK
12 article info
14 abstract
25
15 Article history: Assessment of bacterial diversity through sequencing of 16S ribosomal RNA (16S rRNA) genes has been 26
16 Received 5 September 2019 an approach widely used in environmental microbiology, particularly since the advent of high- 27
17 Received in revised form 15 January 2020 throughput sequencing technologies. An additional innovation introduced by these technologies was 28
18 Accepted 15 January 2020 the need of developing new strategies to manage and investigate the massive amount of sequencing data 29
19 Available online xxxx generated. This situation stimulated the rapid expansion of the field of bioinformatics with the release of 30
new tools to be applied to the downstream analysis and interpretation of sequencing data mainly gener- 31
20 Keywords: ated using Illumina technology. In recent years, a third generation of sequencing technologies has been 32
21 Third generation sequencing developed and have been applied in parallel and complementarily to the former sequencing strategies. In 33
22 MinION
particular, Oxford Nanopore Technologies (ONT) introduced nanopore sequencing which has become 34
23 Microbial diversity
24 very popular among molecular ecologists. Nanopore technology offers a low price, portability and fast 35
sequencing throughput. This powerful technology has been recently tested for 16S rRNA analyses show- 36
ing promising results. However, compared with previous technologies, there is a scarcity of bioinformatic 37
tools and protocols designed specifically for the analysis of Nanopore 16S sequences. Due its notable 38
characteristics, researchers have recently started performing assessments regarding the suitability 39
MinION on 16S rRNA sequencing studies, and have obtained remarkable results. Here we present a 40
review of the state-of-the-art of MinION technology applied to microbiome studies, the current possible 41
application and main challenges for its use on 16S rRNA metabarcoding. 42
2020 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Bio- 43
technology. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/ 44
licenses/by-nc-nd/4.0/). 45
46
47
48 Contents
49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
50 1.1. Current analytical approaches applied in 16S metagenomic studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
51 2. Third generation of sequencing technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
52 3. The potential of the Nanopore sequencing for 16S rRNA studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
53 3.1. Nanopore 16S metagenomic studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
54 3.2. Taxonomic assignment using Nanopore 16S sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
55 3.3. Constraints to move beyond taxonomic assignment with Nanopore sequencing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
56 4. Summary and outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
57 Uncited references . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
58 CRediT authorship contribution statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
59 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
60 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 00
61
62
⇑ Corresponding author at: The Centre for Environment, Fisheries and Aquaculture Science (CEFAS), The Nothe, Barrack Road, Weymouth, Dorset DT4 8UB, UK. E-mail address:
[email protected] (J. Martinez-Urtaza).
https://doi.org/10.1016/j.csbj.2020.01.005
2001-0370/ 2020 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Please cite this article as: A. Santos, R. van Aerle, L. Barrientos et al., Computational methods for 16S metabarcoding studies using Nanopore sequencing data, Computational and
Structural Biotechnology Journal, https://doi.org/10.1016/j.csbj.2020.01.005
CSBJ 436 No. of Pages 11, Model 5G
30 January 2020
2 A. Santos et al. / Computational and Structural Biotechnology Journal xxx (xxxx) xxx
63 1. Introduction bacterial identification has some limitations, including the variable 86

number of copies of these genes in bacterial genomes, the low tax- 87
64 Functionality, interaction, and dynamics of microbial communi- onomic resolution at the species level for some bacterial groups, 88
65 ties are considered critical to the existence of ecological balance and the bias for taxonomic assignment of sequences depending 89
66 and life [1,2]. The fact that only less than 1% of microorganisms on the variable region chosen for the analysis [10]. 90
67 are cultivable under laboratory conditions [3] has presented histor- Until the late 1990s, the 16S rRNA gene was only applied in a 91
68 ical constraints to providing a precise dimension of the microbial taxonomic context to define species uniquely based on individual 92
69 world, and to studying microbial diversity within a taxonomic bacteria obtained from pure (mostly clinical) cultures [6,11]. How- 93
70 context. ever, in 1997, Pace et al. [12] described for the first time the com- 94
71 Since the foundations of molecular phylogeny were established position of microbial communities without the need for cultivation 95
72 in the 1960s and 70s, the 16S rRNA gene has been universally used in the laboratory by employing the sequence of the 16S rRNA gene 96
73 for taxonomic studies of prokaryotic species [4,5]. 16S rRNA is part using Sanger sequencing. This work led to the establishment of a 97
74 of the small ribosomal subunit (SSU) present in all prokaryotic cells universal approach to the study of microbial communities. Nowa- 98
75 and the gene encoding for this molecule possesses some distinctive days, sequence analysis of 16S rRNA continues to be the gold stan- 99
76 characteristics that make it suitable for taxonomic profiling: 1) it is dard for studying microbial diversity, enabling accurate taxonomic 100
77 ubiquitous, being found in all prokaryotic and archaebacteria profiling of the prokaryotic groups present in both clinical and 101
78 organisms [6]; 2) the relatively small size ( 1500 bp) and high environmental samples [11,12]. 102
79 degree of functional conservation [5], 3) the presence of variable The introduction of Sanger sequencing technology in the inves- 103
80 regions in the 16S rRNA gene as result of diverse rates of evolution tigation of microbial communities signified a revolution in the 104
81 among species, which can be used to distinguish different bacterial world of microbial ecology and entirely changed how microbial 105
82 groups [7,8], and 4) the existence of highly conserved regions in diversity was assessed. However, this approach required the anal- 106
83 the gene sequence, which can be used to design universal primers ysis of individual sequences, implying that a cloning step was 107
84 flanking different hypervariable regions (nine in total, V1-V9) iden- needed as a crucial prerequisite for the investigation of samples 108
85 tified in the gene [9]. On the other hand, the use of 16S rRNA for (Fig. 1a). As a result, sequences up to 1000 can be generated. 109
Fig. 1. Most common metabarcoding sequencing strategies for each sequencing technology generation. (a) First generation sequencing (Sanger). Under this approach, metabarcoding is classically
performed by amplifying full-length 16S rRNA genes from an environmental DNA sample; once the amplicon has been obtained, the cloning of the 16S amplicons is performed, sequences are added
into a vector and then transformed into a host; finally, plasmid extraction and purification are performed and the sequencing of 16S rRNA inserts is carried out by the Sanger method. (b) Second
generation sequencing (Illumina). From environmental DNA samples, a PCR amplification of specific regions of de 16S rRNA gene is performed; depending on the scope of the study, one or two
regions of the 16S gene can be amplified, with regions V1-V2 and V3-V4 being the most frequently used; by using these regions, a paired end library (the mix of DNA fragments with adapters
attached to theirs ends and ready to be sequenced) preparation is often used for this purpose, adapters (exogenous nucleic acids that are ligated to a nucleic acid molecule to be sequenced) and index
(unique DNA sequences ligated to fragments within a sequencing library, they allow the posterior sorting and identification of different samples sequenced on a same sequencing run) are added to 16S
amplicon extremes and libraries of 300 bp in length are finally sequenced on the Illumina MiSeq platform. (c) Third generation sequencing (Nanopore). This recently developed approach starts with
the amplification of the full-length 16S rRNA gene from environmental DNA using universal primers; simultaneously, indexes for multiplexing are added to the amplicons in the same PCR reaction;
once amplicons have been purified, the library preparation process is performed, consisting of the addition of a protein at a specific tagged region of the 16S amplicons (10 min for library
preparation); finally direct sequencing of the samples is carried out on the MinION sequencer.
30 January 2020
A. Santos et al. / Computational and Structural Biotechnology Journal xxx (xxxx) xxx 3
Table 1
Comparison of the available sequencing platforms for 16S metagenomic analysis using metabarcoding approach.
Sequencing Read Length Accuracy Output Sequencing Chemistry Run Time Advantages in Metabarcoding
Platform (bp) approaches
Sanger 400–900 99.999% 1.9–84 Kb Dideoxy chain termination 20 min 3 Long read length, high quality
h
Illumina MiSeq 75–300 99.9% 13.2– Sequencing by Synthesis 21–56 h High Throughput, read quality
95% 20 Gb
MinION >200,000 50 Gb Single Sequencing real time-long 1–48 h High Throughput, Long read length,
reads portability
PacBio 10–15 Kb 99.999 5–10 Gb Single Sequencing real time-long 4h Long read length and quality
reads
110 However, the number of sequences that can be analyzed was lim- analysis are QIIME [24], MOTHUR [25] and Phyloseq [26]. In partic- 161
111 ited due the output of Sanger platforms (Table 1). Therefore, a com- ular for 16S metagenomic studies, standard analysis packages and 162
112 plete evaluation of bacterial diversity using Sanger sequencing pipelines typically include a workflow comprising demultiplexing 163
113 became a serious challenge in terms of time and costs. and quality control steps, followed by the generation of Opera- 164
114 Globally, the advent of high-throughput sequencing, or Second tional Taxonomic Units (OTU picking) and/or ‘‘Amplicon Sequence 165
115 Generation Sequencing (SGS) technologies, and its rapid and wide- Variants analysis” (ASV) analysis, which allows the taxonomic 166
116 spread application across laboratories in the early 2000s repre- assignment of representative sequences and diversity analysis of 167
117 sented a paradigm shift in microbial ecology. The characteristic the sample (Fig. 2). Consequently, taxonomic assignment of 168
118 high output and data accuracy provided by these new technologies, sequences is a critical step and the most informative element for 169
119 along with the removal of tedious and time-consuming steps such microbial diversity analyses. 170
120 as the cloning of DNA fragments and electrophoretic separation of A detailed pipeline of the most conventional workflows for 16S 171
121 sequencing products required for Sanger sequencing, makes possi- rRNA Illumina sequences are presented in Fig. 2. Despite the differ- 172
122 ble the generation of massive sequencing data in short run pro- ences between the different packages, the principal components in 173
123 cesses. Among the different companies pioneering high- the workflow are analog and shared a common process, which 174
124 throughput sequencing, Illumina has achieved a leading position includes: quality control of sequences, clustering or ASV analyses, 175
125 in the market, becoming the standard sequencing technology and taxonomic assignment and diversity analyses (Fig. 3). 176
126 the most frequently applied in microbial ecology studies [15,16].
127 The common elements in the sequences generated by this technol-
128 ogy are the reduced length (from 50 bp to 300 bp), high throughput 2. Third generation of sequencing technologies 177
129 (from 2 Gb to 750 Gb), high accuracy, and reduced cost (starting
130 from $40 USD per Gb approximately) [17] (Table 1). In recent years, a third generation of sequencing (TSG) tech-
178
131 Nevertheless, due to the differential characteristics of the Illu- nologies has been developed and have been used in parallel and
179
132 mina and Sanger technologies in terms of sequence length, full- complementarily to the former sequencing strategies. These new 180
133 length sequences of the 16S rRNA gene are not achievable using technologies interrogate a single molecule of DNA in real time 181
134 Illumina sequencing alone. To overcome this limitation, 16S gene and produce very long reads (from 1 to 100 kb). In 2011, Pacific 182
135 analysis with Illumina has been typically restricted to specific vari- Biosciences introduced the first TSG technology, which was termed 183
136 able regions of the 16S rRNA, instead of the complete gene single-molecule real-time sequencing [23,24]. Recent releases of a 184
137 (Fig. 1b). However, the remarkable characteristics of Illumina new sequencer, in particular the Sequel, has improved the output 185
138 sequencing in terms of outputs, accuracy and speed, have made by increasing read length and throughput per run by 10- and 186
139 this technology central in almost all of the most prominent studies 100-fold respectively. However, despite that this new platform is 187
140 based on 16S analysis carried out up to date, including the Human two-fold cheaper than the previous versions, it is still less cost- 188
141 Microbiome Project [18], Earth Microbiome Project [19] and the effective than Illumina and therefore the applications of this plat- 189
142 Extreme Microbiome Project [20]. form to 16S metagenomic studies remain scarce. In addition, the 190
1.1. Current analytical approaches applied in 16S metagenomic studies error rate falls in the same range as the first PacBio version 191
143 ( 13%) [29] and the output is still lower than Illumina. Therefore, 192
price and limited output has restricted the application of the Pac- 193
144 An additional innovation introduced by high-throughput Bio system in microbial community studies [26–28] (Table 1). 194
145 sequencing technologies was the need for new strategies to man- In 2014, Oxford Nanopore Technologies (ONT) introduced nano- 195
146 age and investigate the massive amount of sequencing data gener- pore sequencing [33]. Nanopore sequencing was developed at the 196
147 ated. From the user perspective, this change involved a transition end of the 1980s [34], although the first successful use of this 197
148 from the application of basic computer programs accessible to gen- sequencing technology was reported in 2012 [35]. This sequencing 198
149 eral users in standard computers, to the need for sophisticated technology directly detects the nucleotides without active DNA 199
150 computational analysis requiring advanced bioinformatic skills.synthesis, since a long stretch of single stranded DNA passes 200
151 This situation stimulated the rapid expansion of the field of bioin- through a protein nanopore that is stabilized in an electrically 201
152 formatics applied to microbial ecology studies, mainly with the resistant polymer membrane [29–31]. Specifically, nucleotide 202
153 release of new tools applied to the downstream analysis and inter- detection is based on setting a voltage across this membrane, 203
154 pretation of sequencing data. Nowadays, a large number of power- which is composed by sensors that are able to detect the ionic cur- 204
155 ful tools are available which enable an efficient integration of rent changes shifted by nucleotides occupying the pore in real time 205
156 different types of data [17–19]. while the DNA molecule passes through. 206
157 Within this context, several bioinformatics programs and tools Applying this technology, ONT released the MinION platform in 207
158 for processing amplicon sequencing data are presently available, 2014, with some remarkable advantages such as low price, porta- 208
159 most of them designed to work with V3 and V4 variable regions bility, and fast sequencing chemistry [39]. MinION is basically a 209
160 of the 16S rRNA gene. The most popular packages for 16S amplicon base to grip a flowcell responsible for the direct sequencing of indi- 210
30 January 2020
Fig. 2. Classic pipelines MOTHUR [25] and QIIME2 [24] and their complete workflow for 16S rRNA amplicons analyses, the ‘‘common processes” flow contains all common steps in both
pipelines.
Fig. 3. Recommended MinION 16S rRNA amplicons pipeline for bacterial diversity analysis.
211 vidual DNA strands that translocate nanoscale the pores in the ondary signal such as light or pH, as with Illumina and PacBio 219
212 semiconductor membrane [40]. The most remarkable characteris- [41]. According to the manufacturer, the most recent chemistry 220
213 tic of the MinION Nanopore sequencer is the length of the used in the R9.4.5 version of the flowcell provides an accuracy of 221
214 sequences generated by the flowcell and the amount of data that 95% with an output of 20 Gb. However, the quality of the reads 222
215 can be produced per run. Moreover, MinION is a miniaturized generated by the R9.4.5 flowcell is still lower than those of Illu- 223
216 sequencing device and the smallest available today in the market, mina, which possess an accuracy of 99.9% (Table 1). Typical prob- 224
217 with dimensions of 10 3 2 cm and weight of 87 g. One partic- lems in Nanopore reads are the frequent presence of insertions and 225
218 ular feature is that the sequencing process does not utilize a sec- deletions artificially generated in the sequences that may intro- 226
30 January 2020
227 duce some obstacles to correctly analyze and interpret data from tools designed for other technologies (mostly Illumina) to analyze
290
228 MinION [38]. these sequences. 291
229 Another remarkable characteristic of ONT platforms is that data
230 analysis can be performed from the beginning of the sequencing 3.1. Nanopore 16S metagenomic studies 292
231 run, which could considerably reduce the time of analysis com-
232 pared to Illumina platforms. In addition, costs associated with Studies applying Nanopore sequencing to describe microbial 293
233 the analyses performed by MinION are much lower compared with diversity have conventionally applied a similar approach than pre- 294
234 other sequencing platforms currently applied for 16S metagenomic vious studies, which were mostly Illumina-based, regardless of the 295
235 studies (Table 1). All these characteristics make the MinION an fact that Nanopore generates full-length 16S sequences. With 296
236 accessible technology for many laboratories, which has generated Nanopore, the full length 16S rRNA gene is amplified by PCR using 297
237 a rapid expansion of the use of this technology across the scientific universal primers (27F and 1493R). The library is prepared by the 298
238 community. Within this context, a remarkable and original feature addition of adapters in the amplicon sequences, and samples are 299
239 that ONT have developed is the ‘‘nanopore community,” which is sequenced directly with a flowcell gripped on the MinION device 300
240 part of the ONT website. This ‘‘community” provides a common (Fig. 1 c). 301
241 space where users can get help and feedback on device perfor- Authors have tried to standardize a different 16S-based ampli- 302
242 mance, methodologies, and bioinformatic analysis. It is important con barcoding protocol by using a two PCR step-based protocol, 303
243 to note that there are other ONT platforms that can produce larger with the first process to amplify the 16S rRNA gene and a second 304
244 quantities of sequencing data than the MinION platform, with the one for the addition of adapters for the 16S amplicons sequencing 305
245 same characteristics, such as GridION (100 Gb) and PromethION [54,55]. Another strategy has been based on the use of an ONT 1D2 306
246 (6 Tb) [35] chemistry library preparation where both DNA strands are 307
sequenced (similar to the paired-end sequencing of Illumina), 308
improving the quality of the reads by sequencing both strands of 309
247 3. The potential of the Nanopore sequencing for 16S rRNA the target DNA [56]. Although different strategies have been 310
248 studies applied in published studies using Nanopore sequencing for 16S 311
rRNA metabarcoding, the 16S barcoding Kit of Oxford Nanopore 312
249 Nanopore sequencing brings to 16S rRNA metabarcoding stud- Technologies has been predominantly used with satisfactory 313
250 ies the benefits of both first and second-generation sequencing. results [47–50]. 314
251 ONT platforms generate long reads, allowing cover the full-length Similar to sample preparation, methodologies introduced to 315
252 sequence of 16S rRNA gene (V1-V9 regions) through a fast, cheap, analyze Nanopore 16S amplicons have included a broad range of 316
253 and high throughput process. One of the most relevant advantages bioinformatic tools. Nevertheless, despite the different tools, the 317
254 of the full-length 16S rRNA sequences is that they offer a higher central process in all the published studies is the application of a 318
255 level of taxonomic and phylogenetic resolution for bacterial identi- strategy based on taxonomic assignment [50,49,51,53]. 319
256 fication since all the informative sites of 16S rRNA genes are con-
257 sidered in the analysis [42]. With Illumina sequencing, the 3.2. Taxonomic assignment using Nanopore 16S sequences 320
258 conventional strategy for sequencing the 16S rRNA uses the hyper-
259 variable regions V1-V2 and/or V3-V4 [43], and taxonomy is Compared with Illumina, there is a scarcity of bioinformatic 321
260 assigned based only on these short variable regions of the 16S tools and protocols designed specifically for the analysis of Nano- 322
261 rRNA gene of approximately 300 bp. The analysis of these short pore 16S sequences. The most extensively used tool is the cloud- 323
262 regions provides a limited taxonomic resolution in most cases, fail- based data analysis service EPI2ME (ONT), which provides a num- 324
263 ing to reliably discriminate sequences beyond genus level [37,38]. ber of workflows for end-to-end analysis of nanopore 16S data: 16S 325
264 Moreover, the choice of these regions will produce a direct effect taxonomic classification, a barcoding protocol, and quality filter of 326
265 on the specificity of the taxonomic assignment. For example, V4 reads. For taxonomic assignment, FASTQ files are uploaded on the 327
266 regions better represent the whole bacterial diversity in host- FASTQ 16S protocol of the EPI2ME platform, reads are filtered by 328
267 associated studies, while V1-V2 are more specific for skin micro- quality and then taxonomy is assigned using BLAST to the NCBI 329
268 biota studies. In addition, taxonomic resolution varies for different database, with a minimum horizontal coverage of 30% and a min- 330
269 groups of bacteria when using different portions of the 16S rRNA imum accuracy of 77% as default parameters (ONT). However, this 331
270 gene [46]. By contrast, the resolution obtained with Nanopore tool is not publicly available and only ONT customers can gain 332
271 sequencing is only comparable to levels provided by Sanger 16S access to this tool through a web platform. Moreover, quality fil- 333
272 rRNA sequencing, with the potential for providing better discrimi- ters, adapter trimming, or setting of alignment parameters such 334
273 nation among taxa, a deeper phylogenetic signal, and a more accu- as identity and coverage of sequences, are already configured by 335
274 rate taxonomic placement of 16S rRNA nanopore sequences default and the user cannot modify more than the initial parame- 336
275 [40,37,35]. Another advantage of ONT, is that data can be gener- ters of the quality of reads. Furthermore, the format of the final 337
276 ated in a short runtime (1–48 h) and at an affordable price ( output with the taxonomic assignment results is not compatible 338
277 $50 USD per sample) Table 1. with other tools for performing downstream analyses such as 339
278 As previously mentioned, MinION is one of the most popular diversity and taxonomic differential abundance. To overcome 340
279 ONT platforms today and has been used extensively in genomics these limitations of EPI2ME software, it is necessary to define a dif- 341
280 and transcriptomics studies [41–46], and over the last two years ferent analytical pipeline that considers other bioinformatic tools 342
281 is rapidly growing in studies on microbial diversity. However, available. 343
282 despite the evident benefits of the use of ONT technology in micro- Cusco [54] applied a mapping approach for taxonomic assign- 344
283 bial ecology studies, there are still several factors limiting the ment using the tool Minimap, and was able to determine the tax- 345
284 implementation of these new approaches in the routine analysis onomic composition at the genus and species level for bacterial 346
285 of microbial diversity. The scarcity of tools specifically designed isolates, mock communities, and complex skin samples. However, 347
286 to work with full sequences of the 16S gene have made it extre- the study suggested the need for a more accurate bioinformatic 348
287 mely challenging to carry out a specialized taxonomic analysis of protocol to achieve more reliable results. Another important result 349
288 Nanopore sequences. Moreover, the limited quality of Nanopore of this research is that taxonomic accuracy can be improved by 350
289 16S sequences has represented a serious constraint to apply exiting analyzing sequences longer than 16S rRNA gene, such as the rrn 351
30 January 2020
352 operon (16S rRNA-ITS-23S rRNA; 4500 bp). Using Minimap2 [60], have proven to be the most suitable tools to work with Nanopore
415
353 Kai et al. [58] reported a species-level bacteria identification with data, and they could be considered the best choices at present. 416
354 more than 90% of reads correctly assigned to each species. A subse- In addition, a second critical aspect to consider in taxonomic 417
355 quent study carried out by Hardegen et al. [55] used a BLAST-based assignment is the composition of the database, which generally 418
356 classification and concluded that their pipeline can be suitable for has a strong influence on the percentage of sequences correctly 419
357 taxonomic assignment of 16S rRNA reads from Nanopore sequenc- assigned to different taxonomic levels [69,70]. To date, there are 420
358 ing. Edwards et al. [57] used VSEARCH [61] for taxonomic assign- few curated databases available for microbial identification—the 421
359 ment and reached a confidence level of 75% at the phylum and most frequently used for 16S studies SILVA [71], Greengenes 422
360 family level. A different approach was performed by Ma et al. [72], RDP [62], and NCBI [73]. SILVA database contains taxonomic 423
361 [56], who carried out taxonomic classification using RDP classifier information for the domains of Bacteria, Archaea, and Eukarya. It 424
362 [62], and reported in pure-culture an average annotation accuracy is based primarily on phylogenies for small subunit rRNAs (16S 425
363 of 93.8% and 82.0% at the phyla and genus level, respectively. Mit- for prokaryotes and 18S for Eukarya) [70]. Their taxonomic hierar- 426
364 suhashi et al. [63] analyzed a mock community of pleural effusion chy and rank are constructed according to Bergey’s Taxonomic 427
365 from a patient with empyema using Centrifuge [64] and BLAST for Outlines, List of Prokaryotic Names with Standing in Nomenclature 428
366 taxonomic analysis, successfully identifying all the species pre- (LPSN), and manual curation [74]. Greengenes is the most popular 429
367 sents in the mock community applying Centrifuge [64]. Turner and widely used database, since it is the default database in the 430
368 et al. [59] described the microbiome of a new invasive nemertean QIIME pipeline (http://qiime.org/index.html). It provides Bacterial 431
369 species using Centrifuge [64] for taxonomic assignment, identify- and Archaeal taxonomy based on phylogenetic trees inferred from 432
370 ing 2054 species associated with the microbiome. chimera-free, consistent multiple sequence alignments, but it has 433
371 Considering all of the aforementioned studies, Centrifuge [64] not been updated since May 2013. The NCBI taxonomy contains 434
372 and Minimap [60] have been the most frequently used taxonomic the names of all organisms associated with submissions to the 435
373 classifiers for Nanopore datasets [56,47,50,49,51]. Regarding the NCBI sequence data bases. It is manually curated based on current 436
374 characteristics of both bioinformatic tools, Centrifuge [64] is cap- systematic literature, and uses over 150 sources. It contains some 437
375 able of accurately identifying reads when using databases contain- duplicate names that represent different organisms. Each NCBI 438
376 ing multiple highly similar reference genomes, such as different database node has a scientific name and may have some synonyms 439
377 strains of a bacterial species. Moreover, Centrifuge works by build- assigned to it. Is important to note that this has been the most used 440
378 ing a database of genomes in which unique segments of these gen- database in articles of MinION 16S sequences classification 441
379 omes are identified to build an FM-index (a compressed data [63,57,65,59,58]. The RDP database is based on 16S rRNA 442
380 structure for full-text pattern searching). This FM-index can be sequences from Bacteria, Archaea, and Fungi (Eukarya). It contain- 443
381 used for efficient searches of sequenced reads against genome seg- s16S rRNA sequences available from the International Nucleotide 444
382 ments in a database. On the other hand, Minimap2 [60] is a Sequence Database Collaboration (INSDC) database. Another new 445
383 general-purpose alignment program that maps long DNA database is EzBiocloud, which is a species level resolution database 446
384 sequences against reference genomes such as Human, fungal, bac- made of 61 700 species/phylotypes, including 13 132 species/phy- 447
385 terial, or viral genomes. Minimap2 is >30 times faster than long- lotypes with validly published names, and 62 362 whole-genome 448
386 read mapping tools or cDNA mapping tools and also possesses assemblies that were identified taxonomically at the genus, spe- 449
387 higher accuracy, surpassing most aligners specialized in a single cies, and subspecies levels [75]. 450
388 type of alignment. Although both tools have been applied with suc- Some authors have evaluated the differences in taxonomic 451
389 cess to the analysis of Nanopore data, Minimap was specifically assignment using these databases, [70] and showed that NCBI is 452
390 developed for mapping long reads while Centrifuge was conceived the bigger one in terms of number of sequences, followed by SILVA, 453
391 for a more general purpose (mapping against full genomes data- RDP and Greengenes, respectively. In addition, they found that 454
392 bases) in metagenomic analyses. However, in terms of parameter Silva shares the most taxonomic units with NCBI, and that green 455
393 setting and configuration, Centrifuge offers more variety of mod- genes is the less diverse data base. Moreover, only green genes 456
394 ules and versatility, which could result in a more reliable taxo- and NCBI could get taxonomic assignment to the species level rank, 457
395 nomic assignment. while SILVA allows only genus as the lowest rank. Importantly, 458
396 Other tools such as BLASTN, MEGABLAST and LASTZ [58,56] NCBI database is not curated for all the groups of microorganisms 459
397 have also applied for taxonomic assignment in metabarcoding and may contain duplicated copies of 16S sequences, which can 460
398 studies using Illumina sequencing. Nevertheless, it is important lead to a bias in taxonomic assignment by an overestimation 461
399 to highlight that due to the differences between Nanopore and Illu- because of the high number of some bacterial groups. An example 462
400 mina reads in terms of longer and poorer quality resulting from the of this is the high number of available sequences belonging to 463
401 presence of insertions and deletions on sequences, many of these pathogenic bacterial groups given by the NCBI repository. Con- 464
402 standardized bioinformatics tools and pipelines are not suitable trasting with clinical strains, sequences belonging to extreme envi- 465
403 to be used with Nanopore data. In this context, Magi et al [66,67] ronments still remain scarce in the NCBI database and may be 466
404 have made an assessment of alignment and mapping tools and underrepresented when a taxonomic assignment is carried out. 467
405 concluded that mapping or aligning Nanopore reads against a data- More detailed guidelines for the selection of the database is pro- 468
406 base is particularly challenging due to the size, high number and vided by Park & Won 2018 [74]. 469
407 non-uniform error profiles of these long sequences. This study also A final consideration for the selection of tools is the format for 470
408 found that mapping and alignment tools such as LAST, BWA, output data, since they cannot be compatible with other bioinfor- 471
409 BLASR, and MarginAlign, were inefficient to process Nanopore data matics tools applied for downstream analysis. This particularly 472
410 and the outcomes of these analyses were deeply influenced by the relates to those tools performing statistical tests, and generating 473
411 sequence lengths, since longer sequences contained more errors plots and comparative analyses of taxonomic profiles identified 474
412 [59,60,16,52]. Moreover, Centrifuge has been included as part of in samples. A detailed description of the different options and 475
413 the pipeline for the analysis of nanopore sequences in the new tool applications of the available tools for 16S metagenomic studies 476
414 MINDS [68]. Based on these studies, Centrifuge and Minimap2 using Nanopore data are summarized in Table 2. 477
30 January 2020
Table 2
Different tools used to analyze Nanopore 16S data in metabarcoding studies.
Analysis approach Data processes included Tools used for analysis Taxonomic Data Reference
Base
Profiling of bacterial Basecalling, Demultiplexing, adapters and barcode Albacore V2.3.1, Porechop, Yacrd 0.3, NCBI and rrn [54]
communities trimming, chimera removal, taxonomic assignment Minimap, EPI2ME database
In field metagenome Basecalling, Demultiplexing, Taxonomic assignment, Albacore v1.10, SiINTAX, usearch Ribosomal Database [57]
bacterial community diversity analysis v10.0.240 Project
analysis Albacore 2.2.4, TanTan v13, Minimap2, R GenomeSync [58]
Rapid bacterial pathogens Basecalling, human reads removal, bacterial reads
identification taxonomic assignment database, NCBI
database
Monitoring microbial of an Basecalling, Demultiplexing, adapter trimming, Metrichor, EPI2ME, poRe, Porechop, GreenGenes [55]
anaerobic digestion Taxonomic assignment QIIME, BLAST, database
system Metrichor v2.42.2, Poretools, QIIME 1.9. GreenGenes [56]
Microbiome characterization Basecalling, OTU picking, taxonomy assignment.
RDP classifier, BLASTn database
Microbiome amplicon Bassecalling, alignment, re-orientation of reads, de- Fast5-to-fastq, seqtk, INC-Seq, blastn, No taxonomic [91]
sequencing workflow novo clustering, chimera removal, Graphmap, POA, chopSeq, nanoClust, R assignment
478 3.3. Constraints to move beyond taxonomic assignment with Nanopore introduced through the Nanopore sequencing, represent an 524
479 sequencing data extraordinary limitation to finding similarity between reads. Fur- 525
thermore, the artificial divergence in sequences caused by the poor 526
480 Since most of the analytical tools for taxonomic assignment quality of reads, even when they come from a single organism, can 527
481 have been developed to be applied to Illumina data and cannot produce the effect that each read is identified as a single sequence 528
482 be used for Nanopore sequences, the potential benefits of using variant, leading to an overestimation of bacterial diversity [78]. As 529
483 full-length 16S rRNA sequences has not been systematically a consequence, the analysis of Nanopore reads with inappropriate 530
484 explored. The deeper taxonomic resolution provided by the full OTU clustering tools or using an ASV approach could provide a 531
485 16S gene sequence can reach the genus and species level with completely incorrect picture of the microbial diversity of the sam- 532
486 higher specificity than other approaches, [74–76]. This methodol- ple showing a dataset with very divergent sequences. 533
487 ogy has been applied with success in clinical, forensic and quality Therefore, although the ASV approach is the most complete way 534
488 control of industrial processes where many of the microorganisms to assess bacterial diversity, it is impracticable for Nanopore data 535
489 to be identified are well represented in databases due to their med- analysis, with the only option available being the application of 536
490 ical/human relevance [34,67]. an OTUs-based clustering approach. However, similar limitations 537
491 However, taxonomic assignment is not always the best to the ones identified using ASV can be found when the most pop- 538
492 approach in other ecological contexts where the microbial commu- ular clustering algorithms are applied [83], such as UCLUST [84], 539
493 nity has not been previously studied. In these circumstances, the VSEARCH [61] or CDHIT [85]. The use of the popular pipeline QIIME 540
494 most representative microorganisms living in these habitats may to analyze Nanopore 16S sequences was assessed in a recent study 541
495 remain unexplored and consequently their genomic data are not [56], indicating that the tool failed at the step of OTU picking, 542
496 present in databases, which makes the taxonomic identification which corroborates the aforementioned issue of applying tools 543
497 for many of the reads impossible. This situation is probably even designed for Illumina to Nanopore data. By performing a close or 544
498 more critical working with Nanopore data, since databases are open reference OTU clustering, only a small fraction of the data 545
499 predominantly composed by fragments of the 16S rRNA gene and would be clustered and the main proportion of a dataset will be 546
500 presence of full-length sequences is frequently the exception and composed of singletons, which cause an erroneous overestimation 547
501 not the rule, limiting a reliable taxonomic identification based on of the bacterial diversity in the samples. 548
502 the full sequence of the gene. On the other hand, the presence of As previously mentioned, read quality is one of the most impor- 549
503 a large number of reads without taxonomic assignment has a tant constraints for nanopore data analysis. Basecalling is the most 550
504 direct impact in providing a realistic measure of the biological determinant process for the improvement of sequence quality. 551
505 diversity in the sample, leading to an underestimation of the real Nanopore sequencing is based on the detection of changes in elec- 552
506 number of species. In this context, and as described in section 2, tric currents produced by the passing of DNA strands through a 553
507 to overcome these limitations and the bias induced by a direct tax- nanopore. Each base ideally should have a specific current varia- 554
508 onomic assignment of reads, approaches such as Operational Tax- tion, called an event. Each event is summarized by the mean and 555
509 onomic units (OTU) picking and/or denoising pipelines are variance of the current and by the event duration [86,57]. Transla- 556
510 commonly used for 16S Illumina data analysis [78,79,80] Both tion of this event into a DNA sequence is known as the basecalling 557
511 OTU picking and ASV analyses reduce the duplication and error process. Original basecallers of ONT used Hidden Markov Models 558
512 of representative sequences and allow the analysis of bacterial (HMM), however nowadays new strategies based on the use of 559
513 groups without a database limitation, which allows for a more reli- machine learning are applied in all modern nanopore sequences 560
514 able taxonomic assignment resulting in a more robust definition of basecallers, such as Guppy, DeepNano, and Chiron [86,87]. This 561
515 microbial communities (Table 3). machine learning-based basecallers use neural networks that can 562
516 These analyses need to be performed in order to execute a tax- be trained with real sequencing data. The use of machine learning 563
517 onomic assignment and diversity analysis (Fig. 3). As described approaches has been shown to be effective for improving the qual- 564
518 previously, tools such as DADA2 and Deblur are the most com- ity of nanopore sequencing data and limiting the impact of base 565
519 monly applied in Illumina sequencing pipelines. However, because modifications, insertions, and deletions commonly present in raw 566
520 of the particular characteristics of Nanopore 16S reads (length and data [88]. Therefore, the use of these new approach of machine 567
521 quality), the use of DADA2 and Deblur or any other algorithm learning on nanopore data has been crucial for the sequence qual- 568
522 based on ASV detection, has not as of yet been viable for Nanopore ity improvement and in the short term will probably allow the nec- 569
523 data. The number of errors—mainly insertions/deletions—typically
30 January 2020
Table 3
Bioinformatic tools for 16S rRNA metabarcoding Nanopore data.
Process Tool Input file Programming Available from Reference

languages
Basecalling Albacore Fast5 Python https://nanoporetech.com/ ONT
Guppy Fast5 Python https://nanoporetech.com/ ONT
Deep Nano fast5 Python https://bitbucket.org/vboza/deepnano [86]
Chiron Fast5 Python https://github.com/haotianteng/Chiron [87]
Sequencing report NanoPlot fastq, fasta, Python https://github.com/wdecoster/NanoPlot [92]
sequencing_summary
(Albacore or guppy
basecaller)
pOre fastq, fasta R https://sourceforge.net/projects/rpore/files/ [93]
pauvre fastq https://github.com/conchoecia/pauvre Github
poretools fastq, fast5 Python https://github.com/arq5x/poretools [94]
Demultiplexing Albacore Fast5 Python https://nanoporetech.com/ ONT
qcat fastq Python https://github.com/nanoporetech/qcat Github
porechop fastq, fasta C++, Python https://github.com/rrwick/Porechop Github
Filtering and NanoFilt fastq Python https://github.com/wdecoster/nanofilt [92]
trimming
Filtlong fastq C++, Python https://github.com/rrwick/Filtlong Github
Porechop fastq C++, Python https://github.com/rrwick/Porechop Github
Taxonomic Minimap2 fastq, fasta C++, Python https://github.com/lh3/minimap2 [60]
assignment
Wimp fastq Cloud-based https://nanoporetech.com/ ONT
Centrifuge fastq, fasta g++ https://ccb.jhu.edu/software/centrifuge [64]
LASTZ fastq, fasta g++, python https://github.com/lastz/lastz Github
Clustering NanoClust USEARCH/VSEARCH Python https://github.com/umerijaz/nanopore/ [91]
format blob/master/nanoCLUST.py
CARNAC-LR paf C++, Python https://github.com/kamimrcht/CARNAC-LR [89]
Data exploration Pavian Kraken and MetaPhlan R https://github.com/fbreitwieser/pavian [95]
formats
PHINCH biom Cloud-based https://github.com/PitchInteractiveInc/Phinch [96]
Krona Krona format – https://github.com/marbl/Krona/wiki [97]
MEGAN6 OTU table – http://ab.inf.uni-tuebingen.de/software/megan6/ [98]
Microbiome OTU table, Cloud-based https://www.microbiomeanalyst.ca/ [99]
Analyst taxonomy table
570 essary improvement of nanopore sequences to go beyond the tax- scope, typically using a specific bioinformatic protocol to detect a 597
571 onomic assignment of 16S sequences. particular pathogen or a target bacterial group or taxon, without 598
572 A final and important point to be considered is the difference in considering the analysis of the whole microbial community pre- 599
573 the orientations of reads produced by Illumina and Nanopore sent in the sample. However, most of the current aligners, cluster- 600
574 sequencing technologies. With Illumina, read orientation is defined ing algorithms, and tools cannot process Nanopore data [83] and 601
575 from the beginning of sequencing and therefore sequences are all this remains a challenge to performing a more comprehensive 602
576 in the same orientation, which greatly facilitates bioinformatic analysis of Nanopore 16S rRNA data. 603
577 data analysis. This homogeneity in the sequencing data is essential Due to the potential bias introduced by taxonomic assignment, 604
578 for alignment and clustering because reads can be compared more OTU clustering may represent a more convenient alternative. In 605
579 easily. On the other hand, with the 1D sequencing chemistry of this regard, the new tools developed for transcriptomic de-novo 606
580 Nanopore, adapters can be ligated to one or both ends of the clustering could represent an alternative to explore in the future 607
581 DNA template [78] and DNA strands are sequenced in random ori- [72,73]. As several transcriptomic based studies have been carried 608
582 entations. Consequently, after the basecalling process the dataset is out with Nanopore, a possible alternative would be to apply these 609
583 composed by a mix of forward and reverse sequences that are not varieties of tools for de-novo clustering of all the transcripts origi- 610
584 complementary to each other. Hence, it may be critical to incorpo- nating from a single gene, and apply the same strategy to group all 611
585 rate an additional step to evaluate the orientation of reads prior to the variants of the 16S gene in a sample. Moreover, some of these 612
586 the analysis of Nanopore data in order to reach consistent results. tools have been developed to deal with the particular features of 613
587 According to the points discussed in previous sections relating the Nanopore sequences and, therefore, can be used as a first 614
588 to the availability of tools and their applications for working with approach to implement a specific clustering tool for 16S sequences 615
589 Nanopore sequences, a workflow for 16S rRNA data analysis is pro- from Nanopore. 616
590 posed in the Fig. 3. Finally, many challenges for data analysis have surfaced since 617
the development of the new sequencing technologies. The correct 618
use of available tools has contributed to extending the use of 16S 619
591 4. Summary and outlook data from Nanopore for a first evaluation of the microbial compo- 620
sition. For Nanopore, efforts have been primarily focused on 621
592 With the advent of modern technologies for sequencing, micro- designing tools for basecalling, demultiplexing, and taxonomic 622
593 bial ecology studies based on the analysis of the microbial 16S assignment, according to the demand of consumers and end- 623
594 rRNA gene have become one of the most popular techniques in users of this technology. Certainly, we are still in the first stages 624
595 metabarcoding studies. Most of the studies conducted to date of the genomic revolution and the future will bring new possibili- 625
596 using Nanopore sequences report pipelines applied with a narrow ties for the expansion of these technologies and development of a 626
Please cite this article as: A. Santos, R. van Aerle, L. Barrientos et al., Computational methods for 16S metabarcoding studies using Nanopore sequencing
data, Computational and Structural Biotechnology Journal, https://doi.org/10.1016/j.csbj.2020.01.005
30 January 2020
627 new generation of powerful bioinformatic tools. The best parame- [16] Logares R, Sunagawa S, Salazar G, Cornejo-Castillo FM, Ferrera I, Sarmento H, et al. 694
Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing 695
628 ters concerning the identity, alignment, and database choice must 696
to explore diversity and structure of microbial communities. Environ Microbiol 2014.
629 also be evaluated for each dataset in particular if the identification https://doi.org/10.1111/1462-2920.12250. 697
630 at the species level is required. The 2019 release by ONT of the new 698
[17] Goodwin S, McPherson JD, McCombie WR. Coming of age: Ten years of next- 699
631 version (R10) of the flowcell with a new chemistry, will offer a sub-
generation sequencing technologies. Nat Rev Genet 2016. https://doi.org/ 700
632 stantial improvement in quality and quantity of data, with a con- 10.1038/nrg.2016.49. 701
633 sensus accuracy reaching 99% and an output of 50 Gb. All these [18] A framework for human microbiome research. Nature 2012;486:215–21. 702
634 developments in Nanopore outputs will generate new challenges doi:10.1038/nature11209. 703
[19] Gilbert JA, Jansson JK, Knight R. Earth microbiome project and global systems biology. 704
635 for bioinformatic analysis, but will also bring new opportunities MSystems 2018;3:e00217–e317. https://doi.org/10.1128/mSystems. 00217-17. 705
636 to revolutionize microbial ecology studies. 706
[20] Tighe S, Afshinnekoo E, Rock TM, McGrath K, Alexander N, McIntyre A, et al. 707
Genomic methods and microbiological technologies for profiling novel and extreme 708
637 Uncited references environments for the extreme microbiome project (XMP). J Biomol Tech 2017;28:31–9. 709
https://doi.org/10.7171/jbt.17-2801-004. 710
[21] Scholz MB, Lo CC, Chain PSG. Next generation sequencing and bioinformatic 711
638 [13,14,21,22,32,36,77,81,82,90,100,101,102. bottlenecks: the current state of metagenomic data analysis. Curr Opin Biotechnol 2012. 712
https://doi.org/10.1016/j.copbio.2011.11.013. 713
[22] Ju F, Zhang T. 16S rRNA gene high-throughput sequencing data mining of microbial 714
639 CRediT authorship contribution statement diversity and interactions. Appl Microbiol Biotechnol 2015. https:// 715
doi.org/10.1007/s00253-015-6536-y. 716
[23] Simon C, Daniel R. Metagenomic analyses: past and future trends. Appl Environ 717
640 Andres Santos: Writing - original draft. Ronny van Aerle: Writ- Microbiol 2011. https://doi.org/10.1128/aem.02345-10. 718
641 ing - review & editing. Leticia Barrientos: Writing - review & edit- [24] Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. 719
642 ing. Jaime Martinez-Urtaza: Writing - review & editing. QIIME allows analysis of high-throughput community sequencing data. Nat Methods 720
2010;7:335–6. https://doi.org/10.1038/nmeth.f.303. 721
[25] Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. 722
643 Acknowledgements Introducing mothur: open-source, platform-independent, community-supported software 723
for describing and comparing microbial communities. Appl Environ Microbiol 2009. 724
https://doi.org/10.1128/AEM.01541-09. 725
644 Andres Santos work was supported by the grants CONICYT- [26] McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis 726
645 Doctorado Nacional-2017-21171392; Universidad de La Frontera and graphics of microbiome census data. PLoS ONE 2013;8:. https:// 727
646 CD-FRO1204; Network for Extreme Environments Research doi.org/10.1371/journal.pone.0061217e61217. 728
[27] Pushkarev D, Neff NF, Quake SR. Single-molecule sequencing of an individual human 729
647 (NXR17-0003); DI 19-0079 and Cefas Seedcorn. genome. Nat Biotechnol 2009. https://doi.org/10.1038/nbt.1561. 730
648 We thank Judith Hoffman from Northern Light Translations for [28] Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. Real-time DNA sequencing from 731
649 English revision of this review. single polymerase molecules. Science (80-) 2009. https://doi.org/ 732
10.1126/science.1162986. 733
[29] Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, et al. A tale of three 734
650 References next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences 735
and Illumina MiSeq sequencers. BMC Genom 2012. https://doi.org/10.1186/1471-2164- 736
651 [1] Zhou J, He Z, Yang Y, Deng Y, Tringe SG, Alvarez-Cohen L. High-throughput 13-341.
[30] Mosher JJ, Bowman B, Bernberg EL, Shevchenko O, Kan J, Korlach J, et al. Improved
737
738
652 metagenomic technologies for complex microbial community analysis: open
653 and closed formats. MBio 2015;6. https://doi.org/10.1128/mBio.02288-14. performance of the PacBio SMRT technology for 16S rDNA sequencing. J Microbiol 739
654 [2] Levin SA. Fundamental questions in biology. PLoS Biol 2006;4:e300. Methods 2014. https://doi.org/10.1016/j. mimet.2014.06.012. 740
655 [3] Solden L, Lloyd K, Wrighton K. The bright side of microbial dark matter: 741
656 Lessons learned from the uncultivated majority. Curr Opin Microbiol 2016. [31] Myer PR, Kim MS, Freetly HC, Smith TPL. Evaluation of 16S rRNA amplicon 742
657 https://doi.org/10.1016/j.mib.2016.04.020. sequencing using two next-generation sequencing technologies for phylogenetic analysis 743
658 [4] Dubnau D, Smith I, Morell P, Marmur J. Gene conservation in Bacillus species. of the rumen bacterial community in steers. J Microbiol Methods 2016. 744
659 I. Conserved genetic and nucleic acid base sequence homologies. Proc Natl https://doi.org/10.1016/j.mimet.2016.06.004. 745
660 Acad Sci 1965. https://doi.org/10.1073/pnas.54.2.491.
[32] Pootakham W, Mhuantong W, Yoocha T, Putchim L, Sonthirod C, Naktang C, et al. 746
661 [5] Woese CR, Fox GE. Phylogenetic structure of the prokaryotic domain: the
High resolution profiling of coral-associated bacterial communities using full-length 16S 747
662 primary kingdoms. Proc Natl Acad Sci U S A 1977. rRNA sequence data from PacBio SMRT sequencing system. Sci Rep 2017. 748
663 [6] Amit Roy SR. Molecular Markers in Phylogenetic Studies-A Review. J
https://doi.org/10.1038/s41598-017-03139-4. 749
664 Phylogenetics Evol Biol 2014. https://doi.org/10.4172/2329-9002.1000131.
[33] Jain M, Olsen HE, Paten B, Akeson M. The Oxford Nanopore MinION: Delivery of 750
665 [7] Gutell RR, Gray MW, Schnare MN. A compilation of large subunit (23S and
nanopore sequencing to the genomics community. Genome Biol 2016. 751
666 23S-like) ribosomal RNA structures: 1993. Nucleic Acids Res 1993. https://
667 doi.org/10.1093/nar/21.13.3055. https://doi.org/10.1186/s13059-016-1103-0l. 752
668 [8] Clarridge JE. Impact of 16S rRNA gene sequence analysis for identification of [34] Deamer D, Akeson M, Branton D. Three decades of nanopore sequencing. Nat 753
669 bacteria on clinical microbiology and infectious diseases. Clin Microbiol Rev Biotechnol 2016. https://doi.org/10.1038/nbt.3423. 754
670 2004. https://doi.org/10.1128/CMR.17.4.840-862.2004. [35] van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The Third Revolution in Sequencing 755
671 [9] Gray MW, Sankoff D, Cedergren RJ. On the evolutionary descent of organisms Technology. Trends Genet 2018. https://doi.org/10.1016/j. tig.2018.05.008. 756
672 and organelles: a global phylogeny based on a highly conserved structural
757
673 core in small subunit ribosomal RNA. Nucleic Acids Res 1984. https://doi.org/
674 10.1093/nar/12.14.5837. [36] Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, et al. The potential 758
675 [10] Poretsky R, Rodriguez-R LM, Luo C, Tsementzi D, Konstantinidis KT. Strengths and challenges of nanopore sequencing. Nat Biotechnol 2008. 759
676 and limitations of 16S rRNA gene amplicon sequencing in revealing temporal https://doi.org/10.1038/nbt.1495. 760
677 microbial community dynamics. PLoS ONE 2014. https://doi.org/10.1371/ [37] Feng Y, Zhang Y, Ying C, Wang D, Du C. Nanopore-based fourth-generation DNA 761
678 journal.pone.0093827. sequencing technology. Genom Proteom Bioinf 2015. https://doi.org/ 762
679 [11] Janda JM, Abbott SL. 16S rRNA gene sequencing for bacterial identification in 10.1016/j.gpb.2015.01.009. 763
680 the diagnostic laboratory: pluses, perils, and pitfalls. J Clin Microbiol 2007. [38] Kono N, Arakawa K. Nanopore sequencing: review of potential applications in 764
681 https://doi.org/10.1128/JCM.01228-07. functional genomics. Dev Growth Differ 2019. https://doi.org/10.1111/ dgd.12608. 765
682 [12] Pace NR. A molecular view of microbial diversity and the biosphere. Science
766
683 1997;276:734–40. [39] Jain M, Fiddes IT, Miga KH, Olsen HE, Paten B, Akeson M. Improved data analysis for 767
684 [13] Eloe-Fadrosh EA, Ivanova NN, Woyke T, Kyrpides NC. Metagenomics uncovers the MinION nanopore sequencer. Nat Methods 2015. https://doi. 768
685 gaps in amplicon-based detection of microbial diversity. Nat Microbiol 2016. org/10.1038/nmeth.3290. 769
686 https://doi.org/10.1038/nmicrobiol.2015.32. [40] Schneider GF, Dekker C. DNA sequencing with nanopores. Nat Biotechnol 2012;30:326. 770
687 [14] Batista-García RA, del Rayo Sánchez-Carbente M, Talia P, Jackson SA, O’Leary
771
688 ND, Dobson ADW, et al. From lignocellulosic metagenomes to [41] Plesivkova D, Richards R, Harbison S. A review of the potential of the MinION
TM
772
689 lignocellulolytic genes: trends, challenges and future prospects. Biofuels,
single-molecule sequencing system for forensic applications. Wiley Interdiscip Rev 773
690 Bioprod Biorefining 2016. https://doi.org/10.1002/bbb.1709. Forensic Sci 2019;1:. https://doi.org/10.1002/ wfs2.1323e1323. 774
691 [15] Bukin YS, Galachyants YP, Morozov IV, Bukin SV, Zakharenko AS, Zemskaya TI. 775
692 The effect of 16s rRNA region choice on bacterial community metabarcoding [42] Bahram M, Anslan S, Hildebrand F, Bork P, Tedersoo L. Newly designed 16S rRNA 776
693 results. Sci Data 2019. https://doi.org/10.1038/sdata.2019.7. metabarcoding primers amplify diverse and novel archaeal taxa from the environment. 777
Environ Microbiol Rep 2018. https://doi.org/10.1111/1758-2229.12684. 778
779
30 January 2020
780 [43] Walters WA, Caporaso JG, Lauber CL, Berg-Lyons D, Fierer N, Knight R. [68] Deshpande Reed, Sullivan Kerkhof, Beigel Wade. Offline next generation metagenomics 865
781 PrimerProspector: De novo design and taxonomic analysis of barcoded polymerase chain sequence analysis using MinION detection Software (MINDS). Genes (Basel) 2019. 866
782 reaction primers. Bioinformatics 2011. https://doi.org/ 10.1093/bioinformatics/btr087. https://doi.org/10.3390/genes10080578. 867
783 [69] Escobar-Zepeda A, Godoy-Lozano EE, Raggi L, Segovia L, Merino E, Gutiérrez-Rios 868
784 [44] Kerkhof LJ, Dillon KP, Häggblom MM, McGuinness LR. Profiling bacterial RM, et al. Analysis of sequencing strategies and tools for taxonomic annotation: defining 869
785 communities by MinION sequencing of ribosomal operons. Microbiome 2017;5:116. standards for progressive metagenomics. Sci Rep 2018. https://doi.org/10.1038/s41598- 870
786 https://doi.org/10.1186/s40168-017-0336-9. 018-30515-5. 871
787 [45] Pollock J, Glendinning L, Wisedchanwet T, Watson M. The madness of microbiome: [70] Balvocˇiute˙ M, Huson DH. SILVA, RDP, Greengenes, NCBI and OTT — how do these 872
788 attempting to find consensus ‘‘best practice” for 16S microbiome studies. Appl Environ taxonomies compare? BMC Genomics 2017;18:114. https://doi.org/ 10.1186/s12864- 873
789 Microbiol 2018;84. https://doi.org/ 10.1128/AEM.02627-17. 017-3501-4. 874
790 [71] Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA 875
791 [46] Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, et al. ribosomal RNA gene database project: improved data processing and web-based tools. 876
792 Experimental and analytical tools for studying the human microbiome. Nat Rev Genet Nucleic Acids Res 2013;41:D590–6. 877
793 2012. https://doi.org/10.1038/nrg3129. [72] McDonald D, Price MN, Goodrich J, Nawrocki EP, Desantis TZ, Probst A, et al. An 878
794 [47] Tedersoo L, Tooming-Klunderud A, Anslan S. PacBio metabarcoding of Fungi and other improved Greengenes taxonomy with explicit ranks for ecological and evolutionary 879
795 eukaryotes: errors, biases and perspectives. New Phytol 2018. analyses of bacteria and archaea. ISME J 2012. https://doi.org/ 10.1038/ismej.2011.139. 880
796 https://doi.org/10.1111/nph.14776. 881
797 [48] Bhyan SB, Wee Y, Zhao M, Liu Y, Lu J, Li X. The bioinformatics tools for the genome [73] Federhen S. The NCBI taxonomy database. Nucleic Acids Res 2012. https://doi. 882
798 assembly and analysis based on third-generation sequencing. Brief Funct Genomics org/10.1093/nar/gkr1178. 883
799 2018;18:1–12. https://doi.org/10.1093/bfgp/ely037. [74] Park S-C, Won S. Evaluation of 16S rRNA Databases for Taxonomic Assignments 884
800 [49] Tyler AD, Mataseje L, Urfano CJ, Schmidt L, Antonation KS, Mulvey MR, et al. Using Mock Community. Genomics Inform 2018;16:e24–e24. 885
801 Evaluation of Oxford Nanopore’s MinION sequencing device for microbial whole doi:10.5808/GI.2018.16.4.e24. 886
802 genome sequencing applications. Sci Rep 2018. https://doi.org/ 10.1038/s41598-018- [75] Yoon SH, Ha SM, Kwon S, Lim J, Kim Y, Seo H, et al. Introducing EzBioCloud: A 887
803 29334-5. taxonomically united database of 16S rRNA gene sequences and whole-genome 888
804 [50] McNaughton AL, Roberts HE, Bonsall D, de Cesare M, Mokaya J, Lumley SF, et al. assemblies. Int J Syst Evol Microbiol 2017. https://doi.org/10.1099/ ijsem.0.001755. 889
805 Illumina and Nanopore methods for whole genome sequencing of hepatitis B virus 890
806 (HBV). Sci Rep 2019. https://doi.org/10.1038/s41598-019-43524-9. [76] Callahan BJ, Wong J, Heiner C, Oh S, Theriot CM, Gulati AS, et al. High-throughput 891
807 amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution. 892
808 [51] Prazsák I, Moldován N, Balázs Z, Tombácz D, Megyeri K, Szucs A, et al. Long-read Nucleic Acids Res 2019. https://doi.org/ 10.1093/nar/gkz569. 893
809 sequencing uncovers a complex transcriptome topology in varicella zoster virus. BMC 894
810 Genom 2018. https://doi.org/10.1186/s12864-018-5267-8. [77] Benítez-Páez A, Portune KJ, Sanz Y. Species-level resolution of 16S rRNA gene 895
TM
811 [52] Jenjaroenpun P, Wongsurawat T, Pereira R, Patumcharoenpol P, Ussery DW, Nielsen J, amplicons sequenced through the MinION portable nanopore sequencer. GigaScience 896
812 et al. Complete genomic and transcriptional landscape analysis using third-generation 2016. https://doi.org/10.1186/s13742-016-0111-z. 897
813 sequencing: A case study of Saccharomyces cerevisiae CEN.PK113-7D. Nucleic Acids [78] Earl JP, Adappa ND, Krol J, Bhat AS, Balashov S, Ehrlich RL, et al. Species-level 898
814 Res 2018. https://doi.org/10.1093/nar/gky014. bacterial community profiling of the healthy sinonasal microbiome using Pacific 899
815 [53] Seki M, Katsumata E, Suzuki A, Sereewattanawoot S, Sakamoto Y, Mizushima-Sugano Biosciences sequencing of full-length 16S rRNA genes 06 Biological Sciences 0604 900
816 J, et al. Evaluation and application of RNA-Seq by MinION. DNA Res 2019. Genetics 06 Biological Sciences 0605 Microbiology. Microbiome 2018. 901
817 https://doi.org/10.1093/dnares/dsy038. https://doi.org/10.1186/s40168-018-0569-2. 902
818 [54] Cusco A, Vines J, D’Andreano S, Riva F, Casellas J, Sanchez A, et al. Using MinION to [79] Besser J, Carleton HA, Gerner-Smidt P, Lindsey RL, Trees E. Next-generation 903
819 characterize dog skin microbiota through full-length 16S rRNA gene sequencing sequencing technologies and their application to the study and control of bacterial 904
820 approach. BioRxiv 2017. infections. Clin Microbiol Infect 2018. https://doi.org/10.1016/j. cmi.2017.10.013. 905
821 [55] Hardegen J, Latorre-Perez A, Vilanova C, Gunther T, Porcar M, Luschnig O, et al. 906
822 Methanogenic community shifts during the transition from sewage mono-digestion to co- [80] Rosen MJ, Callahan BJ, Fisher DS, Holmes SP. Denoising PCR-amplified metagenome 907
823 digestion of grass biomass. Bioresour Technol 2018;265:275–81. data. BMC Bioinf 2012. https://doi.org/10.1186/1471-2105-13-283. 908
824 https://doi.org/10.1016/j.biortech.2018.06.005. 909
825 [56] Ma X, Stachler E, Bibby K. Evaluation of Oxford Nanopore MinION Sequencing for [81] Tikhonov M, Leach RW, Wingreen NS. Interpreting 16S metagenomic data without 910
826 16S rRNA Microbiome Characterization. BioRxiv 2017:99960. doi:10.1101/099960. clustering to achieve sub-OTU resolution. ISME J 2015. https://doi. 911
827 org/10.1038/ismej.2014.117. 912
828 [57] Edwards A, Debbonaire AR, Nicholls SM, Rassner SME, Sattler B, Cook JM, et al. In- [82] Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity 913
829 field metagenome and 16S rRNA gene amplicon nanopore sequencing robustly and speed of chimera detection. Bioinformatics 2011;27:2194–200. 914
830 characterize glacier microbiota. BioRxiv 2019:73965. doi:10.1101/ 073965. https://doi.org/10.1093/bioinformatics/btr381. 915
831 [83] Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: Bioinformatics of 916
832 [58] Kai S, Matsuo Y, Nakagawa S, Kryukov K, Matsukawa S, Tanaka H, et al. Rapid long-range sequencing and mapping. Nat Rev Genet 2018. 917
833 bacterial identification by direct PCR amplification of 16S rRNA genes using the https://doi.org/10.1038/s41576-018-0003-4. 918
TM
834 MinION nanopore sequencer. FEBS Open Bio 2019;9:548–57. https:// [84] Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 919
835 doi.org/10.1002/2211-5463.12590. 2010. https://doi.org/10.1093/bioinformatics/btq461. 920
836 [59] Turner AD, Fenwick D, Powell A, Dhanji-Rapkova M, Ford C, Hatfield RG, et al. New [85] Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of 921
837 invasive nemertean species (Cephalothrix Simula) in England with high levels of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. https:// 922
838 tetrodotoxin and a microbiome linked to toxin metabolism. Mar Drugs 2018;16:452. doi.org/10.1093/bioinformatics/btl158. 923
839 https://doi.org/10.3390/md16110452. [86] Boza V, Brejova B, Vinar T. DeepNano: Deep recurrent neural networks for base calling 924
840 [60] Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics in MinION nanopore reads. PLoS ONE 2017;12:. https://doi.org/ 925
841 2018;34:3094–100. https://doi.org/10.1093/bioinformatics/bty191. 10.1371/journal.pone.0178751e0178751. 926
842 [61] Rognes T, Flouri T, Nichols B, Quince C, Mahe F. VSEARCH: a versatile open source [87] Hall MB, Cao MD, Duarte T, Teng H, Coin LJM, Wang S. Chiron: translating nanopore 927
843 tool for metagenomics. PeerJ 2016;4:. https://doi.org/10.7717/ peerj.2584e2584. raw signal directly into nucleotide sequence using deep learning. Gigascience 2018;7. 928
844 doi:10.1093/gigascience/giy037. 929
845 [62] Cole JR, Wang Q, Fish JA, Chai B, McGarrell DM, Sun Y, et al. Ribosomal database [88] Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for 930
846 project: data and tools for high throughput rRNA analysis. Nucleic Acids Res Oxford Nanopore sequencing. Genome Biol 2019. https://doi.org/10.1186/ s13059-019- 931
847 2014;42:D633–42. https://doi.org/10.1093/nar/gkt1244. 1727-y. 932
848 [63] Mitsuhashi S, Kryukov K, Nakagawa S, Takeuchi JS, Shiraishi Y, Asano K, et al. A [89] Marchet C, Lecompte L, Da Silva C, Cruaud C, Aury J-M, Nicolas J, et al. De novo 933
849 portable system for rapid bacterial composition analysis using a nanopore-based clustering of long reads by gene from transcriptomics data. Nucleic Acids Res 2018. 934
850 sequencer and laptop computer. Sci Rep 2017;7:5657. https://doi.org/ 10.1038/s41598- https://doi.org/10.1093/nar/gky834. 935
851 017-05772-5. [90] Sahlin K, Medvedev P. De Novo Clustering of Long-Read Transcriptome Data Using a 936
852 [64] Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive Greedy, Quality-Value Based Algorithm. Lect. Notes Comput. Sci. (including Subser. 937
853 classification of metagenomic sequences. Genome Res 2016;26:1721–9. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), 2019. doi:10.1007/978-3-030- 938
854 https://doi.org/10.1101/gr.210641.116. 17083-7_14. 939
855 [65] Greninger AL, Naccache SN, Federman S, Yu G, Mbala P, Bres V, et al. Rapid [91] Calus ST, Ijaz UZ, Pinto AJ. NanoAmpli-Seq: a workflow for amplicon sequencing for 940
856 metagenomic identification of viral pathogens in clinical samples by real-time nanopore mixed microbial communities on the nanopore sequencing platform. Gigascience 2018;7. 941
857 sequencing analysis. Genome Med 2015;7:99. https://doi.org/ 10.1186/s13073-015-0220- doi:10.1093/gigascience/giy140. 942
858 9. [92] De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: 943
859 [66] Magi A, Semeraro R, Mingrino A, Giusti B, D’Aurizio R. Nanopore sequencing data visualizing and processing long-read sequencing data. Bioinformatics 2018;34:2666–9. 944
860 analysis: state of the art, applications and challenges. Brief Bioinform 2018;19:1256–72. 945
861 https://doi.org/10.1093/bib/bbx062. [93] Santoyo-Lopez J, Risse J, Gharbi K, Thomson M, Blaxter M, Watson M, et al. poRe: an 946
862 [67] Magi A, Giusti B, Tattini L. Characterization of MinION nanopore data for resequencing R package for the visualization and analysis of nanopore sequencing data. Bioinformatics 947
863 analyses. Brief Bioinform 2017. https://doi.org/10.1093/bib/ bbw077. 2014;31:114–5. https://doi.org/10.1093/bioinformatics/ btu590. 948
864 949
30 January 2020
950 [94] Loman NJ, Quinlan AR. Poretools: a toolkit for analyzing nanopore sequence [99] Dhariwal A, Chong J, Habib S, King IL, Agellon LB, Xia J. MicrobiomeAnalyst: a
963
951 data. Bioinformatics 2014;30:3399–401. https://doi.org/10.1093/ web-based tool for comprehensive statistical, visual and meta-analysis of 964
952 bioinformatics/btu555. microbiome data. Nucleic Acids Res 2017;45:W180–8. https://doi.org/ 965
953 [95] Breitwieser FP, Salzberg SL. Pavian: Interactive analysis of metagenomics data 10.1093/nar/gkx295. 966
954 for microbiomics and pathogen identification. BioRxiv 2016. [100] Hsieh TC, Ma KH, Chao A. iNEXT: an R package for rarefaction and 967
955 [96] Bik HM. Phinch: An interactive, exploratory data visualization framework for extrapolation of species diversity (Hill numbers). Methods Ecol Evol 968
956 –Omic datasets. BioRxiv 2014. 2016;7:1451–6. https://doi.org/10.1111/2041-210X.12613. 969
957 [97] Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization [101] Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB. Package 970
958 in a Web browser. BMC Bioinf 2011;12:385. https://doi.org/10.1186/1471- vegan. R Packag Ver 2013. 971
959 2105-12-385. [102] Krehenwinkel H, Pomerantz A, Henderson JB, Kennedy SR, Lim JY, Swamy V, 972
960 [98] Huson DH, Beier S, Flade I, Górska A, El-Hadidi M, Mitra S, et al. Community et al. Nanopore sequencing of long ribosomal DNA amplicons enables 973
961 edition - interactive exploration and analysis of large-scale microbiome portable and simple biodiversity assessments with high phylogenetic 974
962 sequencing. Data. PLOS Comput Biol 2016;12:e1004957. resolution across broad taxonomic scale. GigaScience 2019. https://doi.org/ 975
10.1093/gigascience/giz006. 976
977

Computational Methods For 16S Metabarcoding Studies Using Nanopore Sequencing Data

Uploaded by

Copyright:

Available Formats

Computational Methods For 16S Metabarcoding Studies Using Nanopore Sequencing Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Methods For 16S Metabarcoding Studies Using Nanopore Sequencing Data

Uploaded by

Copyright:

Available Formats

CSBJ 436 No.

of Pages 11, Model 5G

Computational and Structural Biotechnology Journal xxx (xxxx) xxx

63 1. Introduction bacterial identification has some limitations, including the variable 86

Process Tool Input file Programming Available from Reference

You might also like