Shokralla Et Al. 2015

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

OPEN Massively parallel multiplex DNA

SUBJECT AREAS:
sequencing for specimen identification
GENETIC MARKERS
BIODIVERSITY
using an Illumina MiSeq platform
Shadi Shokralla1, Teresita M. Porter2, Joel F. Gibson1, Rafal Dobosz1, Daniel H. Janzen3,
Received Winnie Hallwachs3, G. Brian Golding2 & Mehrdad Hajibabaei1
7 January 2015
Accepted 1
Department of Integrative Biology and Biodiversity Institute of Ontario, University of Guelph, 50 Stone Road East, Guelph, ON,
16 March 2015 Canada N1G 2W1, 2Department of Biology, McMaster University, 1280 Main Street West, Hamilton, ON, Canada L8S 4K1,
3
Department of Biology, University of Pennsylvania, Philadelphia, PA, USA 19104.
Published
17 April 2015
Genetic information is a valuable component of biosystematics, especially specimen identification through
the use of species-specific DNA barcodes. Although many genomics applications have shifted to
Correspondence and High-Throughput Sequencing (HTS) or Next-Generation Sequencing (NGS) technologies, sample
identification (e.g., via DNA barcoding) is still most often done with Sanger sequencing. Here, we present a
requests for materials
scalable double dual-indexing approach using an Illumina Miseq platform to sequence DNA barcode
should be addressed to markers. We achieved 97.3% success by using half of an Illumina Miseq flowcell to obtain 658 base pairs of
S.S. (sshokral@ the cytochrome c oxidase I DNA barcode in 1,010 specimens from eleven orders of arthropods. Our
uoguelph.ca) approach recovers a greater proportion of DNA barcode sequences from individuals than does conventional
Sanger sequencing, while at the same time reducing both per specimen costs and labor time by nearly 80%.
In addition, the use of HTS allows the recovery of multiple sequences per specimen, for deeper analysis of
genetic variation in target gene regions.

T
he use of DNA sequences in biosystematics has revolutionized our understanding of biodiversity from
elucidating deep branches of the Tree of Life to exploring species boundaries and population-level variations
in communities and ecosystems. For example, short, standardized species-specific DNA sequences - DNA
barcodes - have been demonstrated to work well for specimen identification in systematics1,2, ecological research3,
biodiversity inventories4,5, museum collection research6, and forensic applications7. Target gene regions have
been established as DNA barcodes for each kingdom of life (e.g., cytochrome oxidase c subunit I (COI) for
animals8, nuclear internal transcribed spacer (ITS) for fungi9, and rbcL and matK chloroplast genes for plants10). A
number of initiatives currently seek to build and curate public DNA barcode databases for the purpose of
recording, counting, and identifying global biodiversity11–14. In order for DNA sequence libraries to be of maximal
value, they must be built so as to represent major amounts of the described and undescribed global diversity15–17.
Type specimens for each species - holotypes - are by definition, the reference for a given species. It has been
suggested that DNA barcode data for these holotype specimens are necessary for databases18. Many of these type
specimens are contained in museum collections, are very old, and are generally unavailable for standard genomic
data gathering6. Special protocols are needed to access the massive potential sources of data presently stored in
natural history collections.
Another major source of specimens for DNA barcode-based studies is mixed environmental samples. These
samples come from Malaise traps19, freshwater and marine benthos20,21, meiofauna22, and marine zooplankton23.
From the arctic to the neotropics, such mixed environmental samples have revealed a high degree of species-level
genetic diversity5. The recovery of DNA sequence data from both museum specimens and bulk-collected envir-
onmental samples will greatly facilitate the construction of DNA barcode libraries.
Conventional PCR amplification followed by dideoxy chain-termination sequencing (also known as Sanger
sequencing24) has been used for the production of nearly all of the existing content of public DNA barcode
libraries. Cost limitations of Sanger sequencing per specimen, however, severely restrict its ability to be scaled up
to deal with millions of specimens requiring DNA barcoding. In addition, Sanger sequencing requires relatively
high concentrations of high quality DNA template in order to be successful25. Even when successful, the process
produces only a single sequencing signal pattern, or electropherogram, of a maximum of 1,500 base pairs per
individual. This single sequence can be the product of co-amplification of other DNA templates present with the

SCIENTIFIC REPORTS | 5 : 9687 | DOI: 10.1038/srep09687 1


www.nature.com/scientificreports

A - Sanger sequencing success B - Illumina MiSeq sequencing success C - Number of unique sequences per individual

5 0 6 0 1 3 1 0 2 2
8 7
24 19 29

Number of specimens
37 33
58 58 57
69
81
84 87 92 86 92 91 89 91 92 90 83 90 794
68 73 63
55 59
34 32 35 155
23 28 33
11
0 1 2 3+
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Plate number Plate number Number of unique sequences produced

Figure 1 | Results of both Sanger and Illumina MiSeq sequencing of 1,010 individual arthropods from a single Malaise trap sample. (A) Overall success
of generating COI DNA sequences via Sanger sequencing for each of eleven 96-well specimen plates. (B) Overall success of generating COI DNA sequences
via Illumina MiSeq sequencing for each of eleven plates. For (A) and (B), number of individuals per plate producing a COI sequence are shaded dark
below, with unsuccessful individuals above. (C) Number of unique COI DNA sequences produced via Illumina MiSeq sequencing for each individual.

target individual (e.g., intrasample contamination, Wolbachia infec- HCO219827. These amplicons were sequenced via standard Sanger
tion, gut contents) and may not represent the ‘true’ genetic marker of protocols. A total of 537 individuals (53.2%) produced a full-length
the target individual26. This case is common when attempting to (.500 bp) sequence via Sanger sequencing (Fig. 1A). Sanger sequen-
recover DNA sequence data from individuals isolated from bulk cing success ranged from 12.0% (plate 9) up to 91.3% (plate 4). A total
mixed samples (e.g., Malaise traps, benthic samples, soil meiofauna, of 983 individuals (97.3%) produced at least one full-length sequence
marine zoo- and phytoplankton). These circumstances can intro- via Illumina MiSeq sequencing (Fig. 1B). The same region of COI was
duce intra-sample contamination and it is often necessary to use amplified for all individual DNA templates in two smaller, overlap-
vector-based cloning or gel excision to be able to recover the target ping fragments using Ill_LCO1490 x Ill_C_R and Ill_B_F x
gene sequence. These methods are time consuming and labor-intens- Ill_HCO2198 primer sets respectively. The two fragments overlap
ive. Another challenge in recovering DNA sequence data from an by 82 bp. All generated amplicons were dual indexed with unique 5-
individual is specimen body size for some groups. The meiofauna mer multiple identifiers (MIDs) from both directions. The generated
represent organisms from all branches of Animalia that fall roughly amplicons were pooled in groups and re-dual indexed and sequenced
between 50 mm and 0.5 mm in size20. Due to their size, meiofaunal on half of a single Illumina MiSeq flowcell using a V3 Miseq sequen-
organisms cannot be reliably tissue subsampled or, in some cases cing kit (300 bp 3 2). A total of 18,873,718 Illumina paired-end
isolated individually. This restriction has severely hampered efforts reads were filtered for quality and length. Across each of the eleven
to generate genetic marker libraries for these important groups. 96-well plates, a total of 10,480,349 raw FC fragment reads were
We have developed a new multiplexing approach to recovering Illumina paired-end sequenced (mean - 952,759 reads per plate)
DNA barcode sequences from individuals that addresses the prob- and a total of 8,393,369 raw BR fragment reads were sequenced
lem of isolating individuals from mixed environmental samples. By (mean - 763,034 reads per plate). For each of the eleven plates, the
utilizing the high throughput sequencing power of Illumina MiSeq, a raw paired-end reads for the FC fragment and, separately for the BR
platform with a relatively small size and lower operating cost, we fragment, were merged with a minimum overlap of 25 bp. A total of
generate a large number of full-length (658 bp) DNA barcode 9,652,825 paired FC reads (mean - 877,530 paired reads per plate)
sequences from a diverse group of organisms in a single sequencing and a total of 6,020,424 paired BR reads (mean 547,311 paired reads
run. We sequenced and assembled two smaller overlapping frag- per plate) were retained for further processing. After MID sorting
ments of the COI barcode region to overcome the primer specificity and primer trimming, putative chimeric sequences were removed
challenges for the recovery of DNA barcodes. Increased sequencing along with identical duplicate sequences using a 99% sequence sim-
depth per specimen allowed for the generation of multiple DNA ilarity cutoff. The two fragments of each individual were paired,
sequences per specimen. Bioinformatic analysis of these sequences requiring a minimum of 80 bp overlap; a maximum of 0.02 (2%)
determined the ‘true’ barcode for an individual and distinguished it mismatches were allowed in the overlap region. An average of 5,868
from likely intra-sample contamination, Wolbachia, pseudogenes, or (range 5,166 – 6,577) full-length sequences were produced for each
other intrusions. While we used COI sequences, this method is individual. Following de-replication of identical sequences, the num-
adaptable to any chosen genetic marker. It is also scalable to thou- ber of unique, abundant sequences (.10% of total sequences per
sands of individuals per sequencing run. Not only did this method individual) recovered for each individual ranged from zero to six.
recover a greater proportion of DNA sequence data from individuals Illumina MiSeq sequencing success ranged from 92.2% (plate 10) up
than did conventional Sanger sequencing, it also produced it at a to 100% (plates 2, 4, and 8). A total of 794 individuals (78.6%)
much lower cost per specimen. produced exactly one unique full-length assembled COI sequence
via Illumina MiSeq sequencing (Fig. 1C).
Results All sequences produced by both Sanger and Illumina MiSeq
A total of 1,010 individual arthropods were isolated from a single sequencing were identified via BLAST28 comparison to public COI
Malaise trap sample from Area de Conservación Guanacaste, north- databases. Each top hit BLAST result for each sequence for each
western Costa Rica. Each individual was isolated, morphologically individual was then compared to morphological identification
identified to order, and tissue subsampled. Tissue subsamples (i.e., a (Fig. 2). A total of 509 individuals (50.4%) produced a DNA sequence
leg from each individual) were separated into eleven 96-well tissue matching morphological identification via Sanger sequencing.
plates and DNA extracted. The number of non-matching sequences was 28 (2.8%), with the
The standard 59 end of the COI region was amplified for remaining 473 individuals (46.8%) producing no Sanger sequence at
each individual DNA template using the primers LCO1490 and all. The percentage of matching Sanger sequences differed between

SCIENTIFIC REPORTS | 5 : 9687 | DOI: 10.1038/srep09687 2


www.nature.com/scientificreports

A
100% 28 1 0 0 0 0 0 0
12
90% 27

80% 84 85

473
70%
52 43 51
60%

91 55
50%
84
40%

30% 146 141


509 83
20%
24 25
10%

5 1
0%
ra

ra

ra

es

s
a
d

er
er
ra
er
ne

te

te

te

rm

rd
pt
te
pt

ip

op

op
bi

fo
do

O
ip
eo

em

en

oc
om

di
D

er
pi
ol

bi
H

Ps
ym

Le

th
C
C

O
H
s

o
er

Tr
rd
O
ll
A

Top hit in GenBank No sequence Top hit in GenBank


matches morphology does not match morphology

B
100% 5 3
9 0
10 7
9 12
90% 225
3
80% 62 7
28
70%

0 70
60%

50% 55
216 93
210
40% 55
757 57

30% 100
0
20%
26
10%
1
0% 0
d
ne

ra

ra

ra

ra

es

s
er
er

er
te

te

te

te

rm
bi

rd
pt

pt
ip

ip

op

op
om

fo
eo

do

O
D

em

en

oc

di

er
ol

pi
C

bi
H

Ps
ym
C

Le

th
s

m
er

O
H

o
rd

Tr
O
ll
A

Figure 2 | Number and percentage of 1,010 individual arthropod specimens producing a COI DNA sequence that matches morphological identification
based on BLAST comparison to public DNA barcode databases. Panel (A) Sanger-generated barcodes. Panel (B) Illumina-generated barcodes.

SCIENTIFIC REPORTS | 5 : 9687 | DOI: 10.1038/srep09687 3


www.nature.com/scientificreports

Figure 3 | Neighbor-joining diagram of 1,211 COI sequences produced from Illumina MiSeq sequencing of 1,010 individual arthropods. Distance
measurement is calculated in number of base substitutions per site based on the Kimura 2-parameter method. Sequences originating from individuals
morphologically identified as Coleoptera (blue), Psocoptera (red), and Trombidiformes (green) are indicated. Distinction is also made between
sequences that correctly matched morphology based on a BLAST comparison to public COI databases (outlined), and those that did not match
morphology (filled in). Individuals producing a single sequence are depicted as circles, whereas multiple sequences from the same individuals are depicted
with triangles.

arthropod orders, ranging from 1.8% for Trombidiformes to 62.4% identification (38.3%, 72.9%, and 98.2% respectively) (Fig. 2). All
for Hymenoptera, 63.2% for Diptera and 87.5% for Lepidoptera unique sequences produced by Illumina MiSeq (n 5 1,211) were
(Fig. 2A). used for a Neighbor-Joining analysis based on pairwise distance
A total of 757 individuals (75.0%) produced a COI sequence (Fig. 3). Sequences recovered from Coleoptera, Psocoptera, and
matching morphological identification via Illumina MiSeq sequen- Trombidiformes were labeled as either matching or non-matching.
cing. The number of non-matching sequences was 225 (22.3%), with Distinction was also made between individuals producing a single
the remaining 27 individuals (2.7%) producing no Illumina MiSeq Illumina MiSeq sequence and individuals producing multiple
sequence at all. The percentage of matching Illumina MiSeq sequences. For sequences recovered from individuals identified mor-
sequences differed between arthropod orders, ranging from 0.0% phologically as Coleoptera, all but eight were contained within a
for Trombidiformes to 92.9% for Hymenoptera, 93.5% for Diptera single cluster including all matching sequences. The same case was
and 96.9% for Lepidoptera (Fig. 2B). true for Psocoptera, with only six sequences excluded, and
Individuals from the three arthropod orders with the lowest Trombidiformes, with only one sequence excluded.
percentage of matching Illumina MiSeq sequences to morphology Sequences derived from individuals of Coleoptera, Psocoptera,
were selected for further analysis. Coleoptera, Psocoptera, and and Trombidiformes that were contained within the correct order
Trombidiformes had the highest percentages of non-matching cluster but had a BLAST match to an incorrect order represent
Illumina MiSeq sequences when compared to the morphological accurate DNA sequences generated via Illumina MiSeq sequencing

SCIENTIFIC REPORTS | 5 : 9687 | DOI: 10.1038/srep09687 4


www.nature.com/scientificreports

0.3

Pairwise distance
0.2

0.1

0.02
100 200 300 400 500
Individuals
Figure 4 | Pairwise distances between COI DNA sequences generated by Sanger-sequencing and Illumina MiSeq sequencing for 521 individual
arthropods. Circles represent the first Illumina generated cluster, most similar to the Sanger, with other symbols representing second, third, and fourth
Illumina sequences generated from the same individual. The area below the dashed line represents all Illumina sequences sharing at least 98% sequence
similarity with a corresponding Sanger sequence from the same individual.

that do not have a similar match within public COI databases. These Our method, employing Illumina MiSeq sequencing platform to
sequences represent individuals for whom there is no close match in sequence individuals in parallel and to repetitively sequence from one
existing public COI databases and they couldn’t be sequenced by individual in parallel, was able to recover DNA sequences from over
Sanger sequencing. By using a similarity-based clustering approach 97% of specimens in a single attempt. This emphasizes the sensitivity of
it is possible to determine that most of the ‘failures’ of Illumina Illumina Miseq sequencing in recovering DNA barcodes from ampli-
MiSeq sequencing were likely to be correct COI sequences. The cons of low quality and/or quantity that cannot be equaled with Sanger
revised Illumina MiSeq sequencing success rate for Coleoptera, sequencing. For 89% of these individuals, the Illumina sequences recov-
Psocoptera, and Trombidiformes could be recalculated as 95.1%, ered share over 98% sequence similarity with the Sanger sequences
93.8%, and 98.2% respectively. recovered from the same individual. For the other 11% of individuals,
To investigate the accuracy of the Illumina barcoding approach as it cannot be assumed that the Sanger sequence is ‘correct’ and the
compared to Sanger sequencing, pairwise distances between Sanger Illumina sequence ‘incorrect.’ These individuals possibly represent
and Illumina sequences generated by the same individual were cal- instances in which Illumina sequencing was able to recover an accurate
culated (Fig. 4). Of the 521 individuals for which both Illumina and sequence, whereas Sanger sequencing did not. The Illumina-generated
Sanger sequences were produced, 429 (82%) produced Sanger and barcode sequence of each individual is the product of over one thousand
Illumina sequences with no sequence difference. A total of 463 (89%) forward and reverse sequences of the first fragment and over one thou-
individuals produced Sanger and Illumina sequences with less than sand forward and reverse sequences of the second fragment followed by
2% sequence difference. assembling a contig of both fragment clusters. Conversely, the Sanger-
To explore the sequencing depth of the Illumina MiSeq approach, generated sequence is the product of a single forward and a single
all generated sequences from individuals of the two arthropod orders reverse sequence. Sequencing error or sequence interpretation error
represented by the greatest number of individuals (Diptera n 5 231; can be detected and filtered out when thousands of sequences are
Hymenoptera n 5 226), regardless of sequencing abundance, were considered, but not when only a single sequence is present.
recovered and analyzed. All sequences that were generated for each In cases of multiple sequences being recovered from a single spe-
individual for the two COI segments were paired, dereplicated, and cimen, two different similarity-based assessments were used to dis-
identified via BLAST comparison to public COI databases. Eleven tinguish the ‘true’ DNA barcode from intra-sample contamination
individuals of Diptera and nineteen individuals of Hymenoptera (Fig. 1C and Fig. 3). In addition to public database comparisons,
generated at least one additional sequence that was identified as sequence similarity-based confirmation of recovered sequences
Wolbachia sp. (Proteobacteria: Rickettsiales: Anaplasmataceae). may be necessary for some groups of organisms. This is especially
true for groups like those arthropods for which there is low coverage
in public DNA databases19.
Discussion The use of an HTS approach for building sequencing libraries
It has been demonstrated that when DNA barcode libraries are more allows for deep-sequencing to recover low-abundance sequences
complete at the species level, the frequency of correct assignment of within each individual. These additional sequences can include het-
novel DNA sequences to upper taxonomic levels increases29,30. Large- eroplasmous copies of the target gene and intracellular parasitic
scale efforts to recover DNA sequence data from fresh and archival bacteria (e.g., Wolbachia), if present33.
specimens have shown some level of success (e.g., 50–86%) of poten- We recommend a new workflow for generating DNA barcode
tial DNA barcodes recovered31,32, but require a substantial amount of sequences (Fig. 5A). Morphological identification of specimens is
repeated sequencing effort. optional within the workflow and could be completed at a later time
A high Sanger sequencing failure rate is not unusual for large-scale for confirmatory purposes. The method is adaptable to all organisms
DNA barcoding projects26. This is presumably due to insufficient (i.e., plants, animals, fungi, bacteria) and all genetic markers (i.e., COI,
amplification primer specificity, co-amplification of non-target ITS, rbcL, 16S, 18S). We calculated the cost and time investments in
amplicons, or the presence of competing sequence information DNA sequence generation using Sanger sequencing compared to our
(e.g., heteroplasmy and endosymbiotic bacteria) within individuals. new method (Fig. 5B). The new method represents a 27% reduction in
Depending on the importance of the samples, some failures could be total time and 78% reduction in hands-on time in addition to a 79%
dealt with by using alternative PCR primers or changing the condi- reduction in laboratory costs. This cost reduction will increase with
tions of PCR prior to Sanger sequencing. In our present research, the projected advances in HTS technology. The presented workflow also
low frequency of taxonomic assignment for COI sequences in some allows research laboratories to employ a single HTS platform for both
arthropod groups, is likely due to an underrepresentation of Costa metabarcoding of bulk environmental samples and the generation of
Rican specimens in publicly available DNA barcode libraries19,26. barcodes for individual specimens.

SCIENTIFIC REPORTS | 5 : 9687 | DOI: 10.1038/srep09687 5


www.nature.com/scientificreports

A Bulk environmental sample B


Sanger sequencing Illumina sequencing

Process/Labor time
Sort/tubes
Hands on Hands off Hands on Hands off
time time time time

Identify morphologically
PCR amplification PCR/dual indexing
5 hr 10 hr 5 hr 10 hr

Subsample individuals E-gel


5 hr 1 hr
Sequencing reaction Illumina sequencing
5 hr 10 hr 1 hr 27 hr
Dual indexing
Sequencing cleanup
Illumina MiSeq sequencing
5 hr 0 hr
Bioinformatic analysis
Sanger sequencing
2 hr 10 hr
Manual sequence editing Bioinformatic analysis
Deep sequencing Barcoding library 10 hr 0 hr 1 hr 2 hr
Total time
32 hr 31 hr 7 hr 39 hr
Recover all Recover most
possible sequences abundant sequences Approximate cost
(chemistry and consumables)

Compare to Compare to $7 / specimen $1.5 / specimen


public databases morphology

Figure 5 | Workflow and cost and time analysis for the generation of DNA sequence data from multiple specimens using Illumina MiSeq sequencing.
(A) The recommended new workflow. (B) A cost and time analysis of the new workflow versus conventional Sanger sequencing for ,1000 individuals.

Methods and, separately for the BR fragment, were merged with SEQPREP software (https://
The Malaise sample was collected at Bosque Humedo, Area de Conservación github.com/jstjohn/SeqPrep) requiring a minimum Phred quality score of 20 and a
Guanacaste (latitude 10.85145; longitude -85.60801; altitude 290 m; date January 24– minimum overlap of 25 bp. A total of 9,652,825 paired FC reads (mean 877,530
31, 2011). The sample was collected directly into 95% ethanol, and frozen at 220uC paired reads per plate) and a total of 6,020,424 paired BR reads (mean 547,311 paired
until thawed and processed in September 2011. reads per plate) were retained for further processing. Paired FC and paired BR reads
were quality trimmed using CUTADAPT v1.4.1 requiring a minimum length of
Tissue subsampled plates were DNA extracted using a Nucleospin Tissue kit
300 bp and a maximum length of 400 bp for the FC fragment and a minimum length
(Macherey- Nagel Inc., Bethlehem, PA, USA) according to manufacturer’s protocols.
of 400 bp and a maximum length of 500 bp for the BR fragment34. A bioinformatic
The standard 59 end of the COI region was amplified using the primers LCO1490 and
pipeline was created using Perl to dereplicate quality trimmed reads using CD-HIT
HCO219827. Each PCR amplification contained 2 mL DNA template, 17.5 mL
v4.6 with the ’cd-hit-est’ algorithm, and perform chimera filtering using USEARCH
molecular biology grade water, 2.5 mL 10X reaction buffer, 1 mL 50X MgCl2
v6.0.307 with the ’de novo UCHIME’ algorithm35–37. At each step, cluster sizes were
(50 mM), 0.5 mL dNTPs mix (10 mM), 0.5 mL forward primer (10 mM), 0.5 mL
retained, singletons were retained, and only putatively non-chimeric reads were
reverse primer (10 mM), and 0.5 mL Invitrogen Platinum Taq polymerase (5 U/mL)
retained for further processing. A semi-automated bioinformatic pipeline was created
in a total volume of 25 mL. PCR conditions were 95uC for 5 minutes; 35 cycles of 94uC using Perl to process the putatively non-chimeric FC and BR reads for each specimen
for 40 seconds, 51uC for 1 minute, and 72uC for 30 seconds; and 72uC for 5 minutes. and remove the associated tag and primers from each fragment. USEARCH with the
Amplicon sequences were obtained using an ABI 3730XL sequencer (Applied UCLUST algorithm was used to de-replicate and cluster the remaining sequences
Biosystems) and the sequencing traces were edited and assembled using CodonCode using a 99% sequence similarity cutoff. A mapping file of tags was created and used to
Aligner v 3.7.1.1 (CodonCode). The same region of COI was amplified for all indi- map FC and BR sequence clusters from each 96-well plate. A semi-automated
vidual DNA templates in two smaller, overlapping fragments (FC and BR) using bioinformatic pipeline was created using Perl to compare FC and BR fragment
Ill_LCO1490 x Ill_C_R (59. GGIGGRTAIACIGTTCAICC.39) and Ill_B_F (59. sequence clusters for each specimen using BLAST (blastn, megablast) requiring a
CCIGAYATRGCITTYCCICG.39) x Ill_HCO2198 primer sets respectively. The two minimum 98% sequence identity for each high-scoring segment pair (HSP), a min-
fragments overlap by 82 bp. The above mentioned amplification regime was used imum HSP length of 25 bp, with no more than 2 alignment mismatches. For BLAST
with a modification in the annealing temperature (48uC for FC and 46uC for BR). All results that meet these criteria, each full-length FC and BR fragment was paired using
amplifications were completed on a Mastercycler ep gradient S (Eppendorf, SEQPREP requiring a minimum of 80 bp overlap and a maximum of 0.02 mis-
Mississauga, ON, Canada). A negative control reaction with no DNA template was matches allowed in the overlap region.
included in all experiments. The generated amplicons were dual indexed with unique More details of the dual-indexing, amplification, sequencing and post sequencing
5-mer multiple identifiers (MIDs) from both directions. The designed MIDs include bioinformatic processing are available by request from the corresponding author.
40 different 5-mer identifiers (AAGCT, ATTGC, AGATC, AGCAT, TTCAG,
TGATC, TCAAG, TGAGC, CAATG, CATTG, CTTGA, CTGAA, ATGCA, AGCTT,
TGCAA, TGCCA, TCATG, CATGA, CTGAT, CATGC for FC fragment and 1. Bertrand, C. et al. Mitochondrial and nuclear phylogenetic analysis with Sanger
ATGCT, ATGCC, AGCTG, AGCTC, TGCAT, TGCAG, TCAGA, TCAGG, CAGAT, and next-generation sequencing shows that, in Área de Conservación Guanacaste,
CCTGA, CTCAG, CTGCA, ATCAG, AGCCT, ATCTG, TCAGC, TCTGA, TCCAG, northwestern Costa Rica, the skipper butterfly named Urbanus belli (family
CAGCT, CTGAG for BR fragment) The generated 22 amplicons plates were re-dual Hesperiidae) comprises three morphologically cryptic species. BMC Evol. Biol. 14,
indexed and pooled into a single tube and sequenced on half of a Miseq flowcell using 153 (2014).
a V3 Miseq sequencing kit (300 3 2)(FC-131-1002 and MS-102-3001). 2. Chacón, I. A., Janzen, D. H., Hallwachs, W., Sullivan, J. B. & Hajibabaei, M.
For all eleven plates, a total of 18,873,718 Illumina paired-end reads were filtered Cryptic species within cryptic moths: new species of Dunama Schaus
for quality and length. For each plate, the raw paired-end reads for the FC fragment (Notodontidae, Nystaleinae) in Costa Rica. ZooKeys 264, 11–45 (2013).

SCIENTIFIC REPORTS | 5 : 9687 | DOI: 10.1038/srep09687 6


www.nature.com/scientificreports

3. Decaëns, T., Porco, D., Rougerie, R., Brown, G. G. & James, S. W. Potential of 27. Folmer, O., Black, M., Hoeh, W., Lutz, R. & Vrijenhoek, R. DNA primers for
DNA barcoding for earthworm research in taxonomy and ecology. Appl. Soil Ecol. amplification of mitochondrial cytochrome c oxidase subunit I from diverse
65, 35–42 (2013). metazoan invertebrates. Mol. Marine Biol. Biotech. 3, 294–299 (1994).
4. Janzen, D. H. et al. Wedding biodiversity inventory of a large and complex 28. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local
Lepidoptera fauna with DNA barcoding. Philos. T. R. Soc. B 360, 1835–1845 alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
(2005). 29. Ekrem, T., Willassen, E. & Stur, E. A comprehensive DNA sequence library is
5. Janzen, D. H. et al. Integration of DNA barcoding into an ongoing inventory of essential for identification with DNA barcodes. Mol. Phylogenet. Evol. 43,
complex tropical biodiversity. Mol. Ecol. Res. 9 Suppl s1, , 1–26 (2009). 530–542 (2007).
6. Van Houdt, J. K., Breman, F. C., Virgilio, M. & De Meyer, M. Recovering full DNA 30. Costa, F. O. et al. A ranking system for reference libraries of DNA barcodes:
barcodes from natural history collections of Tephritid fruitflies (Tephritidae, application to marine fish species from Portugal. PLoS One 7, e35858 (2012).
Diptera) using mini barcodes. Mol. Ecol. Res. 10, 459–465 (2010). 31. Hernández-Triana, L. M. et al. Recovery of DNA barcodes from blackfly museum
7. Wallace, L. J. et al. DNA barcodes for everyday life: routine authentication of specimens (Diptera: Simuliidae) using primer sets that target a variety of sequence
Natural Health Products. Food Res. Int. 49, 446–452 (2012). lengths. Mol. Ecol. Res. 14, 508–518 (2014).
8. Hebert, P. D. N., Cywinska, A., Ball, S. L. & de Waard, J. R. Biological
32. Hebert, P. D. N. et al. A DNA ‘barcode blitz’: rapid digitization and sequencing of a
identifications through DNA barcodes. P. Roy. Soc. B – Biol. Sci. 270, 313–321
natural history collection. PLoS One 8, e68535 (2013).
(2003).
33. Smith, M. A. et al.Wolbachia and DNA barcoding insects: patterns, potential, and
9. Schoch, C. L. et al. Nuclear ribosomal internal transcribed spacer (ITS) region as a
problems. PLoS One7, e36514 (2012).
universal DNA barcode marker for Fungi. P. Natl. Acad. Sci. USA 109, 6241–6246
(2012). 34. Martin, M. Cutadapt removes adapter sequences from high-throughput
10. CBOL Plant Working Group. A DNA barcode for land plants. Proceedings of the sequencing reads. EMBnet.journal 17.1, 10–12 (2011).
National Academy of Science 106, 12794–12797 (2009). 35. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets
11. Maidak, B. L. et al. The Ribosomal Database Project. Nucleic Acids Res. 22, of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
3485–3487 (1994). 36. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST.
12. Savolainen, V., Cowan, R. S., Vogler, A. P., Roderick, G. K. & Lane, R. Towards Bioinformatics 26, 2460–2461 (2010).
writing the encyclopedia of life: an introduction to DNA barcoding. Philos. T. R. 37. Edgar, R. C., Haas, B. J., Clemente, J. C., Quince, C. & Knight, R. UCHIME
Soc. B 360, 1805–1811 (2005). improves sensitivity and speed of chimera detection. Bioinformatics 27,
13. Hajibabaei, M., Singer, G. A., Clare, E. L. & Hebert, P. D. N. Design and 2194–2200 (2011).
applicability of DNA arrays and DNA barcodes in biodiversity monitoring. BMC
Biol. 5, 24 (2007).
14. Ratnasingham, S. & Hebert, P. D. N. BOLD: the barcode of life data system (www.
barcodinglife.org). Mol. Ecol. Notes 7, 355–364 (2007). Acknowledgments
15. Meyer, C. P. & Paulay, G. DNA barcoding: error rates based on comprehensive This project was funded by the Government of Canada through Genome Canada and the
sampling. PLoS Biol. 3, e422 (2005). Ontario Genomics Institute through the Biomonitoring 2.0 project (OGI-050) to M.H., an
16. DeWalt, R. E. DNA barcoding: a taxonomic point of view. J. N. Am. Benthol. Soc. NSF grant DEB 0515699 to D.H.J. and the JRS Biodiversity Foundation and the Wege
30, 174–181 (2011). Foundation of Grand Rapids, Michigan, to the Guanacaste Dry Forest Conservation Fund.
17. Kvist, S. Barcoding in the dark?: A critical view of the sufficiency of zoological J.F.G. is also funded by an NSERC Postdoctoral Fellowship. The funders had no role in
DNA barcoding databases and a plea for broader integration of taxonomic study design, data collection and analysis, decision to publish or preparation of the
knowledge. Mol. Phylogenet. Evol. 69, 39–45 (2013). manuscript. We are grateful to Area de Conservación Guanacaste for protecting the forest
18. Kvist, S., Oceguera-Figueroa, A., Siddall, M. E. & Erséus, C. Barcoding, types and habitat that we sampled.
the Hirudo files: using information content to critically evaluate the identity of
DNA barcodes. Mitochondr. DNA 21, 198–205 (2010).
19. Gibson, J. et al. Simultaneous assessment of the macrobiome and microbiome in a Author contributions
bulk sample of tropical arthropods through DNA metasystematics. P. Natl. Acad. S.S. and M.H. conceived and designed the experiments. D.H.J. and W.H. collected all
Sci. USA 111(22), 8007–8012. (2014). specimens examined. S.S. performed Sanger sequencing and Illumina MiSeq sequencing.
20. Hajibabaei, M., Shokralla, S., Zhou, X., Singer, G. A. C. & Baird, D. J. S.S., J.F.G., T.M.P., B.G., R.D. and M.H. analyzed sequence data. S.S., J.F.G., D.H.J., W.H.
Environmental barcoding: a next-generation sequencing approach for and M.H. wrote and edited the final manuscript.
biomonitoring applications using river benthos. PLoS One 6, e17497 (2011).
21. Ranasinghe, J. A., Stein, E. D., Miller, P. E. & Weisberg, S. B. Performance of two
southern California benthic community condition indices using species Additional information
abundance and presence-only data: relevance to DNA barcoding. PLoS One 7, Accession codes: All Sanger and Illumina generated sequences have been deposited in
e40875 (2012). GenBank (accession nos. KP843909-KP844445) and DRYAD (doi: 10.5061/dryad.j897m).
22. Fonseca, V. G. et al. Second-generation environmental sequencing unmasks Competing financial interests: The authors declare no competing financial interests.
marine metazoan biodiversity. Nature Comm. 1, 98 (2010).
23. Laforest, B. J. et al. Insights into biodiversity sampling strategies for freshwater How to cite this article: Shokralla, S. et al. Massively parallel multiplex DNA sequencing for
microinvertebrate faunas through bioblitz campaigns and DNA barcoding. BMC specimen identification using an Illumina MiSeq platform. Sci. Rep. 5, 9687; DOI:10.1038/
Ecol. 13, 13 (2013). srep09687 (2015).
24. Sanger, F., Nicklen, S. & Coulson, A. R. DNA sequencing with chain-terminating
inhibitors. P. Natl. Acad. Sci. USA 74, 5463–5467 (1977). This work is licensed under a Creative Commons Attribution 4.0 International
25. Polz, M. F. & Cavanaugh, C. M. Bias in template-to-product ratios in License. The images or other third party material in this article are included in the
multitemplate PCR. Appl. Environ. Microbiol. 64, 3724–3730 (1998). article’s Creative Commons license, unless indicated otherwise in the credit line; if
26. Shokralla, S. et al. Next-generation DNA barcoding: using next-generation the material is not included under the Creative Commons license, users will need
sequencing to enhance and accelerate DNA barcode capture from single to obtain permission from the license holder in order to reproduce the material. To
specimens. Mol. Ecol. Res. 14, 892–901 (2014). view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

SCIENTIFIC REPORTS | 5 : 9687 | DOI: 10.1038/srep09687 7

You might also like