Academia.eduAcademia.edu

Minimus: a fast, lightweight genome assembler

2007, BMC Bioinformatics

Background: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run.

BMC Bioinformatics BioMed Central Open Access Software Minimus: a fast, lightweight genome assembler Daniel D Sommer1, Arthur L Delcher1, Steven L Salzberg1,2 and Mihai Pop*1,2 Address: 1Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA and 2Computer Science Department, University of Maryland, College Park, MD 20742, USA Email: Daniel D Sommer - [email protected]; Arthur L Delcher - [email protected]; Steven L Salzberg - [email protected]; Mihai Pop* - [email protected] * Corresponding author Published: 26 February 2007 BMC Bioinformatics 2007, 8:64 doi:10.1186/1471-2105-8-64 Received: 6 October 2006 Accepted: 26 February 2007 This article is available from: http://www.biomedcentral.com/1471-2105/8/64 © 2007 Sommer et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run. Results: We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly. Conclusion: We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge. Background With the advent of whole-genome shotgun (WGS) sequencing in the mid-1990s, the genomics community had an urgent need for software that could process tens of thousands of individual sequence "reads" and assemble those into the genome from which they had come. The first generation of assemblers, including TIGR Assembler [1], phrap [2], and CAP3 [3], were able to assemble smallto medium-sized bacterial genomes, often requiring several weeks of computer time on the fastest computers then available. As sequencing technology improved, ever larger projects were attempted with the WGS method, and it became clear that new methods were needed. For the 130 million base pair (Mbp) genome of the fruit fly Drosophila melanogaster, an entirely new assembler was developed [4], which incorporated many new ideas about efficient memory usage and sophisticated repeat processing. The Celera Assembler (CelAsm) was also the first algorithm to use mate pair information to any serious degree: taking advantage of the fact that most reads in a WGS project are generated in pairs, that system used the expected distance between reads in a pair to impose many useful constraints Page 1 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 http://www.biomedcentral.com/1471-2105/8/64 on the overall assembly. Other large-scale WGS assemblers followed, including Arachne [5,6], which was used to assemble the 2.6 billion base pair (Gbp) mouse genome [7], Phusion [8], Atlas [9], and JAZZ [10]. these sequences are generated, the final step is to assemble the gap. This requires that the newly generated sequences, often spanning just a few kilobases or even less, be assembled together with the two surrounding contigs. As these systems have scaled up to meet the needs of very large WGS projects, they have grown in size and complexity, so that today, only a few very sophisticated bioinformatics groups have the expertise needed to install and run them. Like many large systems, these assemblers are relatively brittle, meaning that they often crash if the data does not conform to fairly rigid specifications. However, because they produce far superior results to the first generation of assemblers, the leading genome centers have focused their efforts on these large assemblers to the exclusion of other approaches. Large-scale assemblers such as CelAsm and Arachne can be used for this task, but this presents several problems. First, the scale of these programs means that simply loading them into memory can take longer than the execution time of the assembly itself. Second, the laboratory teams filling gaps typically use graphical tools to manage gaps, and configuring these tools to call a very large external program is impractical if not impossible. Third, and perhaps most telling, the cleverness of these WGS assemblers is a hindrance for gap closure, because the data do not conform to the characteristics of a typical shotgun process. The depth of coverage of finishing reads often differs from that in the surrounding areas thereby confusing the statistical repeat detection mechanisms present in large-scale assemblers, and preventing a correct assembly of the gap. Therefore an assembler for gaps will do better by using a simple, straight-forward algorithm, focused on a specific region of the genome. Finally, these assemblers cannot be easily modified to address the specific issues raised by specialized finishing procedures, especially as new finishing techniques are continuously being developed. For example, high-throughput finishing experiments often use transposons to sample a problematic region, resulting in paired reads that are facing away from each other (the sequencing proceeds away from the transposon). Such constraints cannot be easily incorporated in existing assemblers which are hard-coded to assume paired reads are facing inwards, towards the middle of the corresponding shotgun fragment. Flexible tools like Minimus and AMOS provide the potential for incorporating such information through add-on modules. Meanwhile, a host of new genome sequencing applications has arisen that place different demands on assembly algorithms. Although large-scale sequencing has pushed assembly technology in productive directions, small-scale sequencing efforts have proliferated as well. Our group recognized the growing need for an assembler that could assemble a handful of sequencing reads with a minimum of overhead (both computational and human), and as a result we have developed Minimus, a fast, "lightweight" assembler that addresses these needs. Before describing the algorithm and our results, we will describe several of these motivating applications. Gap closure Since the very first bacterial genome, Haemophilus influenza [11], was sequenced, we and our colleagues at The Institute for Genomic Research (TIGR) have been developing methods for closing the gaps in a draft genome. The initial assembly of a WGS project normally produces a large collection of contiguous pieces of DNA (contigs) that are separated by gaps. Improvements in sequencing and assembly technology have yielded fewer gaps per megabase in recent years, but nonetheless, the increased scale of sequencing has meant that large centers have many more gaps to fill. One unintended by-product of this trend is that many genomes today are left in "draft" form: the initial assembly is the only assembly, and the published genome consists of hundreds or thousands of unordered contigs. Fortunately, many genomes, especially those of the greatest scientific interest, are still being finished, which means that all gaps need to be closed. Gap closure consists of running additional sequencing reactions that fill in the gap between two adjacent contigs. If the gap is filled with repetitive sequence (which is often the case), then "closure" teams may go to great lengths to clone and sequence small DNA fragments that correctly span the gap. Once Gene assembly Another important use of small-scale assembly takes advantage of the rapidly growing Trace Archive at NCBI [12], a public repository of all the raw data from many large sequencing projects. Because it takes months and sometimes years before the final, assembled sequence from a genome project is released, scientists use the BLAST search function at the Trace Archive to find reads matching a gene of interest. If the gene is contained in the Trace Archive data, then a search will return anywhere from a handful to a few hundred sequences. These need to be assembled together to produce a better picture of the genomic region containing the gene. Once again, the scientist needs a small, less finicky assembler for this purpose. Page 2 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 Small genomes Although most sequencing capacity is taken up by the largest genome projects, the number of small genomes being sequenced easily outstrips – in number of species and strains – the number of large genomes. Ironically, some of the very clever and complicated ideas that make CelAsm, Arachne, and other assemblers work for large genomes make them less than ideal for these small genomes. Viruses are a good example: they typically have genomes ranging from 5–50 kilobases, and they contain relatively little repetitive DNA. Thus there is no need to characterize the repeat content, and a simple assembler that ignores the issues of large-scale WGS projects will produce a perfectly correct assembly more quickly. For example, the Influenza Genome Sequencing Project, which uses an RT-PCR strategy rather than WGS, has assembled over 1000 influenza genomes using Minimus [13], with savings coming from not having to address special formatting requirements to prepare the data and from not having to maintain a large assembly software package. Implementation Implementation details of Minimus The Minimus assembler was built in a modular fashion from software modules available within the AMOS assembly package [25] and is released as one of the components of this package. AMOS is an open-source software package that provides researchers with a collection of modules and software libraries that are useful in the development of genome assembly and analysis software. A full description of the AMOS package is beyond the scope of this paper and will be published elsewhere (M.Pop, manuscript in preparation). Minimus consists of the combination of three AMOS modules, following the traditional overlap-layout-consensus paradigm [26]. These modules interact with each other through a central AMOS data-structure (called a bank) as shown in Figure 1. The three modules are: 1. hash-overlap – a sequence overlapper that uses minimizers [27] to increase speed and decrease memory usage. 2. tigger – a unitigger, i.e. tool that identifies clusters of reads which can be uniquely assembled based on algorithms developed by Myers [28,29]; in graph theoretic terms a unitigger identifies maximal interval subgraphs of the overlap graph. 3. make-consensus – a progressive multiple alignment program that refines the read layout generated by the unitigger to build a precise multiple alignment of these reads. Note that sequence quality values are only used during the generation of the multiple alignment consensus (step 3). http://www.biomedcentral.com/1471-2105/8/64 Other assemblers, such as phrap, use the quality values as an integral component of the assembly algorithm. We found that, due to the high quality of data produced by modern sequencing instruments, the explicit consideration of quality values during the overlap and unitigging steps is unnecessary. Instead we only use the quality data to trim the poor quality flanks of each read (see below under Sequence trimming), and to compute the consensus (and associated quality values) for the multiple alignment of co-assembled reads. An execution of Minimus consists of the following stages, described in detail below. Input stage The shotgun reads are loaded into the AMOS bank. The inputs are presented as an AMOS message file, whose format is modeled on the format used by Celera Assembler [4]. Virtually any existing format for representing shotgun data can be easily converted to this message format with the help of conversion tools distributed with the AMOS package. Overlap stage The hash-overlap program is used to compute all pairwise alignments between the reads provided in the input. Unitigger stage The tigger module constructs a graph representation of the set of overlaps determined in the overlap stage. The overlap graph contains a node for each shotgun read, and an edge connects two nodes if the corresponding reads overlap. The unitigger then uses several reduction steps to simplify this graph, and generate a set of unitigs, based on algorithms originally developed by Myers [28,29]. Briefly, these reduction steps are: 1. Removal of containment edges. Reads completely contained within other reads in the input are removed from the graph. 2. Transitive reduction. For any set of three reads (A, B, and C), if the overlap between A and C can be inferred from the overlaps between reads A and B, and B and C, this overlap (i.e. the edge corresponding to this overlap) is removed from the graph. 3. Unique-join collapsing. Every simple path in the graph (paths that contain no branches, i.e. all the nodes have inand out-degrees equal to 1) are collapsed into a single vertex. Each such vertex represents an individual unitig. Consensus stage The final stage of Minimus constructs the full multiple alignment of the reads aligned within each unitig, using as Page 3 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 http://www.biomedcentral.com/1471-2105/8/64 Figure 1 of Minimus pipeline Overview Overview of Minimus pipeline. Several independent modules of the AMOS package (shown as ovals) interact through the AMOS API to a central data-structure (called a Bank). The order of execution of the individual modules is shown by the arrow. Note that the inputs and outputs to minimus follow the AMOS file format (AMOS message files). The AMOS package provides converters between this file format and virtually all commonly used formats for representing sequence data and genome assemblies. Page 4 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 a guide the approximate placement of the reads inferred from the overlap information. Sequence trimming The criteria used for trimming the vector sequence and the poor quality flanks of shotgun reads vary significantly depending on the source of the data and on the protocols employed during sequencing. In designing Minimus we, thus, opted to perform the trimming of the data with external software tools that can be customized to the specific characteristics of the data. For the examples described in this paper we followed two different approaches for sequence trimming: 1. For data where we had confidence that the Trace Archive clipping coordinates were correct (i.e. the two bacterial genomes) we simply used the coordinates provided to us. 2. For the other data-sets (zebrafish gene and mouse BACs) we followed the protocol described at [30], specifically we used the program Lucy [31] for quality trimming, followed by a k-mer based vector trimming protocol. Note that while phrap performs some trimming based on quality values, in order to ensure consistent trimming of the data, we provided phrap with sequences already trimmed according to the protocol described above. Extraction of GPC3 homologues from zebrafish shotgun data To extract the set of zebrafish shotgun reads that map to the human GPC3 gene, we built an NCBI Blast database containing the high-quality region of the zebrafish reads (obtained by removing the sequencing vector and the poor quality regions). We then aligned the protein sequence of the human GPC3 gene using tblastn with an E-value cutoff of 0.01. All reads matching GPC3 under these extremely relaxed criteria were then provided to Minimus for assembly. Results To demonstrate the capabilities of Minimus we present its application to the assembly of several small data-sets: influenza A virus isolates, individual genes, and BAC clones. We compare the performance of Minimus to that of phrap [2], the "small assembler" most commonly used for such small assembly tasks. We also used Minimus to assemble two bacterial genomes, Brucella suis, and Staphylococcus aureus, to illustrate its potential use as one of the components of a complex assembly pipeline. Genome assemblers such as Atlas [9], developed at the Human Genome Sequencing Center at the Baylor College of Medicine, and Phusion [8], developed at the Sanger Center, represent such assembly pipelines. Both assemblers use a http://www.biomedcentral.com/1471-2105/8/64 hierarchical approach to partition the reads into small sets during an initial clustering step, then assemble each of the clusters with the phrap assembler. Before describing our results we would like to emphasize the fact that the comparisons to phrap provided below are inherently skewed due to the fact that phrap and Minimus were designed to solve different problems. These comparisons are relevant, however, because phrap has been widely applied to assembly tasks that fall outside the scope of the original intended use for this program. We will demonstrate that Minimus provides scientists with a better tool for small assembly tasks, be it the assembly of viral genomes or individual genes, or as a component in a larger assembly pipeline such as Atlas or Phusion. The high stringency of the algorithms employed by Minimus obviates the need for the complex modules commonly used (e.g., the RPphrap module of Phusion [8]) in such assembly pipelines to correct the errors introduced by phrap. In addition, the flexibility provided by Minimus' well defined interfaces and open-source license, allow scientists to adapt and extend our software as needed by their specific projects. Such enhancements are virtually impossible with phrap due to the restrictive license and code release model. Assembly of influenza A virus isolates Assembling the influenza A virus is an ideal application for Minimus due to the small size of the virus. The influenza A sequencing project, currently underway at TIGR [13], has been using Minimus to assemble the genomes of more than 1400 individual isolates of the influenza virus. The sequencing pipeline at TIGR generates approximately 200 sequencing reads for each viral isolate, providing approximately 4-fold coverage of the 8 segments composing the flu genome. The assembly of the influenza genome is performed in a hierarchical manner, building a collection of contigs using Minimus with high stringency settings, then improving this assembly during two additional passes that combine Minimus with quality trimming software. In approximately 95% of the cases (J. Sitz, personal communication), this hierarchical process results in complete reconstructions of each of the segments, these data forming the substrate for genome annotation and for other subsequent analyses. The whole assembly process, including the time needed to access the database used to store the reads and the resulting assemblies, takes approximately 4 minutes. The actual time used by Minimus for assembling the data is approximately 2 seconds/segment during each of the three passes. The shotgun reads, and the assemblies produced by Minimus are made freely available to the scientific community by submission to the NCBI Trace and Assembly Archives [14]. Page 5 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 Assembly of individual genes One of the applications that initially drove the development of Minimus is the assembly of an individual gene from reads "fished" out of a shotgun dataset by alignment to a homologous gene from a related organism. This application is particularly relevant to the study of large eukaryotic genomes that are being sequenced but for which no assembly has yet been made available to the scientific community. While sequencing is a highly automated process, the assembly of large genomes is a timeconsuming activity that requires extensive manual intervention, particularly in the case of large, highly repetitive genomes, or genomes with highly divergent homologous chromosomes. Thus, it is not uncommon for the raw shotgun data to be deposited in the Trace Archive months, and sometimes years, before an assembly of a genome is made available, even in a draft form. This situation makes it difficult for scientists to ask questions such as "does this organism being sequenced have a homologue of gene X?", or "how many copies of gene Y are present in this genome?" Such questions are often difficult to answer even if a draft assembly is available, as evidenced, for example, by the absence of chromosome Y-linked genes in an early draft of Drosophila pseudoobscura; in that case, investigators found the genes of interest by directly examining the underlying shotgun data [15]. To highlight the application of Minimus to assembling individual genes directly extracted from the shotgun data, we attempted to assemble the zebrafish (Danio rerio) homologues to the human glypican-3 (GPC3) gene. The GPC3 gene is highly expressed during development and has been implicated in a variety of cancers as well as in the Simpson-Golabi-Behmel overgrowth syndrome (see, e.g., [16-19]). We chose this combination of organisms due to the large evolutionary distance between human and zebrafish, as well as the fluid nature of the draft assembly of the zebrafish genome (currently at version 6 and still being actively improved). We extracted 175 Danio rerio shotgun reads (from among the 24,961,699 reads publicly available at the NCBI Trace Archive) that could be mapped to the sequence of the human GPC3 protein (see Methods). We then assembled these reads using Minimus, resulting in 16 contigs, representing individual exons of the zebrafish GPC3 homologue. To ascertain the quality of the reconstruction, we mapped the individual contigs to the human gene. The overview of the alignments is shown in Figure 2 (top), indicating that approximately half of the human GPC3 gene is covered by high quality matches to four contigs generated by Minimus. Interestingly, these four contigs do not share any significant similarity at the nucleotide level, indicating the presence in zebrafish of at least four homologues to the human GPC3 gene, not unexpected given http://www.biomedcentral.com/1471-2105/8/64 that in human GPC3 is part of a larger gene family. This result could, however, not be immediately inferred from the Zebrafish Information Network (ZFIN) – the central database for the zebrafish community. A search for "glypican" in ZFIN returns a single entry – that for the zebrafish homologue to GPC3. To ascertain whether the incomplete coverage of the human GPC3 is due to limitations in our methodology, or to actual differences between the human and zebrafish homologues, we aligned the annotated zebrafish GPC3 homologue to the human protein (Figure 2 bottom). The alignment reveals the zebrafish GPC3 gene to be shorter than its human counterpart, consistent with our reconstruction. In fact, the Minimus contigs cover most of the zebrafish gene, with the exception of approximately 100 amino acids at the C terminus. This comparison also reveals a limitation of our approach. Short exons and/or splicing differences between the human and zebrafish homologues of the gene may prevent a simple translated search from identifying the shotgun reads necessary to reconstruct the full length gene. Despite such limitations, we believe our results show that Minimus can be successfully used as a first step in characterizing the homologues of a gene of interest in a newly sequenced organism. Furthermore, the approach we chose can be easily augmented to hierarchically recruit additional shotgun reads that extend the initial set of contigs, eventually reconstructing assemblies of entire genes. We implemented a simple version of such a procedure by also recruiting the mates for all reads identified during the translated searches. Unfortunately, the inclusion of these reads into the assembly process only resulted in marginal improvements. Better results will undoubtedly be obtained by extending this process to also incorporate reads that overlap the reconstructed contigs, however an implementation of such a procedure is beyond the scope of the current paper. BAC clone assembly The sequencing of complex genomes sometimes follows a hierarchical process, whereby the DNA is first sheared into segments of between 50-150 kbp which are then amplified in E. coli. These segments, called Bacterial Artificial Chromosomes (BACs), are then sequenced through the shotgun method. This hierarchical approach can overcome the complexity of highly repetitive genomes (e.g., Zea mays [20]), and has also been applied to the exploration of environmental samples (see, e.g., [21]). The shotgun sequencing of individual BAC clones generates approximately 2000–3000 sequencing reads, which can be easily assembled with Minimus. We extracted, at random, from the NCBI Trace Archive a collection of 10 shotgun libraries generated from mouse BAC clones, and assembled these data with both Minimus and phrap. All the selected BAC clones have been finished, providing us Page 6 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 http://www.biomedcentral.com/1471-2105/8/64 Top: Figure Blast 2 alignments of contigs resulting from minimus assembly of Zebrafish shotgun reads to the human GPC3 protein Top: Blast alignments of contigs resulting from minimus assembly of Zebrafish shotgun reads to the human GPC3 protein. The top four matches correspond to contigs that do not have any significant similarity to each other at the nucleotide level, indicating the presence of at least 4 homologues to the GPC3 gene in the Zebrafish genome. Bottom: Alignment of the Zebrafish GPC3 protein to the human GPC3 protein highlighting that the minimus assembly covers the majority of this gene. with a "gold standard" for evaluating the correctness of the assemblies. The results of this comparison are summarized in Table 1. On these datasets, Minimus ran faster than phrap (running time was approximately 60% of the running time of phrap) and produced contigs that mapped with few errors to the finished sequences; in contrast, the phrap contigs contained up to five times as many errors as Minimus (see Figure 3 for an example of erroneous alignments to the reference sequence) when compared to each of the finished BACs. These results are unsurprising as phrap's aggressive attempts to generate longer contigs (Minimus generated contigs that were about four times smaller than phrap) often result in misassemblies [22]. We argue that the conservative approach taken by Minimus is preferable in the case of BAC assembly, as mis-assemblies are often difficult to identify and correct, whereas the fragmented assemblies produced by Minimus can be easily improved by utilizing mate-pair information and by using alignment information between the individual contigs. Assembly of bacterial genomes In addition to the assembly of small datasets such as those described above, Minimus can also be used in conjunc- Page 7 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 http://www.biomedcentral.com/1471-2105/8/64 Table 1: Comparison of Minimus and phrap in the assembly of 10 mouse BACs from data obtained from the NCBI Trace Archive. BAC BAC size (bp) # Reads/seq. coverage RP23-179K16 195,061 RP23-188E5 157,996 RP23-111A22 200,329 RP23-271013 239,837 RP23-283E4 178,084 RP23-286D16 195,068 RP23-296N18 187,242 RP23-319P12 190,514 RP23-363E23 199,409 RP23-425H1 188,835 3685 8 2983 7 5428 10 7601 14 5708 15 4969 8 1536 6 5629 14 5301 12 1536 6 Running time # Contigs N50 contig size (kbp) Coverage (%) # errors 1 m 45 s 2 m 55 s 1m5s 2 m 33 s 56 s 1 m 43 s 3 m 11 s 6 m 30 s 3 m 49 s 9 m 53 s 41 s 1 m 22 s 36 s 1m5s 1 m 39 s 2 m 29 s 1 m 19 s 2 m 12 s 14 s 38 s 40 14 43 16 244 183 448 329 713 467 90 264 52 34 131 139 111 178 46 28 4.2 16.1 4.3 16.9 4.8 17 1.5 6.3 1.4 3.9 9 40 6.5 16.1 4.9 18 5.5 20 10 21 99.9 99.9 99.9 99.7 98.7 98.4 100.0 98.6 99.9 98.7 99.9 99.9 99.9 99.0 99.9 98.0 99.9 100 97.2 98.4 0 2 0 2 3 14 2 10 2 8 2 5 0 12 3 18 3 15 1 5 Minimus phrap Minimus phrap Minimus phrap Minimus phrap Minimus phrap Minimus phrap Minimus phrap Minimus phrap Minimus phrap Minimus phrap Minimus ran considerably faster than phrap and produced no errors, at the expense of a larger number of contigs. Note that the table contains two quantities denoted "coverage": the sequencing coverage (reported in the #Reads/seq. coverage column) represents the total amount of DNA in the sequenced reads, divided by the size of the chromosome, i.e. the redundancy in the sequenced data; the column headed "coverage" represents the fraction of the reference sequence covered by assembled contig. The latter measure does not take into account assembly errors, i.e. partial contig matches contribute to the overall coverage. tion with other assembly modules (such as the scaffolder Bambus [23]) to build an assembly pipeline for larger genomes. To assess the suitability of Minimus as a replacement for the phrap assembler in pipelines such as Atlas or Phusion, we compared the two assemblers on their ability to assemble two bacterial genomes: Brucella suis, and Staphylococcus aureus. Both genomes were sequenced and fully finished at TIGR, and all the sequencing reads generated for these projects are publicly available at both the NCBI Trace Archive, and from our website [24]. The availability of a finished molecule allowed us to compare the correctness of the assemblies generated by Minimus and phrap respectively, as shown in Figure 3. The results of our comparison are shown in Table 2. Similar to the case of BAC assembly, Minimus ran faster than phrap (approx. 2– 5 times faster) and produced no errors. The phrap assembly contained multiple errors (8 in B. suis and 5 in S. aureus), though it produced larger contigs, 4–5 times larger than those produced by Minimus. Again, the conservative approach taken by Minimus, as well as its efficiency, make it a better choice as the core component of a genome assembly pipeline such as Atlas or Phusion. In this context, mis-assemblies may present more challenges than the relatively smaller contigs generated by Minimus. Discussion One, perhaps surprising, result of our experiments is the higher fragmentation of the BAC assemblies in compari- son to the bacterial assemblies (observed both for Minimus and phrap), even though the BACs were sequenced to a deeper level of coverage. The reason for this fragmentation is the higher density of repeats in the mouse genome. Eukaryotic genomes often contain high-copy repeats that disrupt the assembly process, even within the range of a BAC insert. Such complex repeats are less frequently encountered in bacteria. Conclusion We have described Minimus, a shotgun sequence assembly program designed for the assembly of small data-sets, and shown that Minimus can be successfully used to extract individual genes from shotgun data-sets, thereby providing scientists with the means to analyze newly sequenced organisms long before complete genome assemblies are made available. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines, as shown by its use at TIGR as the main workhorse in the influenza virus sequencing pipeline. Traditionally, phrap has been used as a main component of such pipelines. We compared Minimus to phrap on two median-sized assembly tasks, BAC clones and bacterial genomes, and found that Minimus is able to perform such assemblies more efficiently and more accurately than phrap, at the cost of producing smaller contigs. We would like to emphasize the fact that it important to obtain a correct assembly, even if this Page 8 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 http://www.biomedcentral.com/1471-2105/8/64 Figure Dot plots 3 of alignments of assemblies produced by minimus (top) and phrap (bottom) to the completed Brucella suis genome Dot plots of alignments of assemblies produced by minimus (top) and phrap (bottom) to the completed Brucella suis genome. The horizontal lines indicate the boundary between assembled contigs represented on the y axis. The vertical line separates between the two chromosomes of Brucella suis represented on the x axis. The minimus assembly (top) perfectly matches the reference sequence, as indicated by all matches lying along the main diagonal (except the contig at the bottom center, which spans the origin of the circular chromosome). The phrap assembly (bottom) shows many discrepancies with respect to the reference sequence (off-diagonal segments), including several contigs that incorrectly join segments of the two distinct chromosomes (e.g., second and third contigs from the bottom). Note that the ordering of the contigs implied by these figures is an artifact of the alignment to the reference sequence and does not correspond to the order in which the contigs were reported by the specific assembly tools. The discrepancies between the phrap assembly and the reference sequence prevent us from providing a consistent ordering for this assembly. Page 9 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 http://www.biomedcentral.com/1471-2105/8/64 Table 2: Comparison of Minimus and phrap in the assembly of two bacterial genomes (Brucella suis and Staphylococcus aureus). Genome Genome size (Mbp) # Reads/seq. coverage B. suis 3.3 S. aureus 2.9 36080 7.8 49014 9.2 Running time # Contigs N50 contig size (kbp) coverage (%) # errors 6 m 30 s 30 m 2 s 16 m 40 s 40 m 110 39 85 30 43 196 51 190 99.9% 99.7% 99.8% 99.7% 0 8 0 5 Minimus phrap Minimus phrap Minimus ran faster than phrap and produced no errors. However, it generated a considerably larger number of contigs. Note that the table contains two quantities denoted "coverage": the sequencing coverage (reported in the #Reads/seq. coverage column) represents the total amount of DNA in the sequenced reads, divided by the size of the chromosome, i.e. the redundancy in the sequenced data; the column headed "coverage" represents the fraction of the reference sequence covered by assembled contig. The latter measure does not take into account assembly errors, i.e. partial contig matches contribute to the overall coverage. assembly is fragmented. Assembly errors are often difficult to detect and correct, and are usually resolved through an expensive and time-consuming process of manual curation (no automated tools exists for this task), while fragmented assemblies can easily be improved in a highthroughput fashion by, for example, hierarchically combining the fragmented contigs based on lower-stringency overlap information. These results highlight the potential for Minimus to be used as a replacement for phrap in assembly pipelines such as Atlas or Phusion, especially as these pipelines already implement mechanisms for combining contigs. Also note that the errors in the phrap assemblies are an artifact of the greedy assembly algorithm used by phrap and cannot be resolved by simply adjusting the stringency of the assembly process. Operating systems: Unix (tested on Linux x86 and x86_64, Mac OSX, cygwin, Solaris, and Tru64) Finally, the modular design of Minimus (and its Open Source license) allows scientists to easily fine-tune, or replace, individual components of the assembly pipeline, tailoring the execution of Minimus to the specific characteristics of the data. Such fine-tuning is impossible in phrap, partly due to its restrictive license, and also due to its monolithic design. Minimus is therefore more than a simple assembler: it can be thought of as a potential testbed for evaluating specific assembly approaches, whether for educational purposes as part of a bioinformatics curriculum, or during the conduct of research in genome assembly. Authors' contributions Programming languages: C++, Perl Other requirements: none for Minimus, some components of AMOS require the QT library License: OSI Artistic License Any restrictions to use by non-academics: none Test data for running Minimus can be downloaded from the Minimus website: http://amos.sourceforge.net/docs/ pipeline/minimus.html. DDS implemented the unitigger and the overall execution pipeline and ran the assemblies presented in the results section. ALD implemented the overlapper and multialigner programs. SLS provided the initial impetus for the design of Minimus and encouraged and oversaw the integration with the flu sequencing pipeline. MP led the design of the package, provided conversion utilities for various assembly formats, and performed the analysis of the zebrafish GPC3 homologues. All authors contributed to writing the manuscript. Acknowledgements Availability and requirements Minimus is distributed under an Open Source license (the Artistic License) as a component of the AMOS package [25]. The details for this package are provided below. Project name: AMOS Project homepage: http://amos.sourceforge.net We thank Martin Shumway and Jeff Sitz from The Institute for Genomic Research for providing us with detailed information on the use of Minimus as part of the Influenza A sequencing pipeline. We also thank Mike Schatz for providing us with a vector- and quality-trimmed set of zebrafish reads, and the anonymous reviewers for their detailed and insightful comments. Finally, we thank Marina Lee for suggesting GPC3 as a test of Minimus' ability to reconstruct the assembly of individual genes. The development of Minimus was supported in part by NIH under grants R01-LM06845 and R01-LM007938 to SLS and by DHS cooperative agreement W81XWH-052-0051. Page 10 of 11 (page number not for citation purposes) BMC Bioinformatics 2007, 8:64 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Sutton GG, White O, Adams MD, Kerlavage AR: TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects. Genome Science and Technology 1995, 1:9-19. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998, 8(3):186-194. Huang X, Madan A: CAP3: A DNA Sequence Assembly Program. Genome Research 1999, 9:868-877. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A wholegenome assembly of Drosophila. Science 2000, 287(5461):2196-2204. Batzoglou S, Berger B, Mesirov J, Lander ES: Sequencing a Genome by Walking with Clone-End Sequences: A Mathematical Analysis. Genome Research 1999, 9:1163-1174. Jaffe DB, Butler J, Gnerre S, Mauceli E, Lindblad-Toh K, Mesirov JP, Zody MC, Lander ES: Whole-genome sequence assembly for Mammalian genomes: arachne 2. Genome Res 2003, 13(1):91-96. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420(6915):520-562. Mullikin JC, Ning Z: The phusion assembler. Genome Res 2003, 13(1):81-90. Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA: The Atlas genome assembly system. Genome Res 2004, 14(4):721-732. Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A, Gelpke MD, Roach J, Oh T, Ho IY, Wong M, Detter C, Verhoef F, Predki P, Tay A, Lucas S, Richardson P, Smith SF, Clark MS, Edwards YJ, Doggett N, Zharkikh A, Tavtigian SV, Pruss D, Barnstead M, Evans C, Baden H, Powell J, Glusman G, Rowen L, Hood L, Tan YH, Elgar G, Hawkins T, Venkatesh B, Rokhsar D, Brenner S: Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 2002, 297(5585):1301-1310. http://www.biomedcentral.com/1471-2105/8/64 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al.: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995, 269(5223):496-512. NCBI Trace Archive [http://www.ncbi.nlm.nih.gov/Traces] Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, Subbu V, Spiro DJ, Sitz J, Koo H, Bolotov P, Dernovoy D, Tatusova T, Bao Y, St George K, Taylor J, Lipman DJ, Fraser CM, Taubenberger JK, Salzberg SL: Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution. Nature 2005, 437(7062):1162-1166. Salzberg SL, Church D, DiCuccio M, Yaschenko E, Ostell J: The genome Assembly Archive: a new public resource. PLoS Biol 2004, 2(9):E285. Carvalho AB, Clark AG: Y chromosome of D. pseudoobscura is not homologous to the ancestral Drosophila Y. Science 2005, 307(5706):108-110. Blackhall FH, Merry CL, Davies EJ, Jayson GC: Heparan sulfate proteoglycans and cancer. Br J Cancer 2001, 85(8):1094-1098. Pilia G, Hughes-Benzie RM, MacKenzie A, Baybayan P, Chen EY, Huber R, Neri G, Cao A, Forabosco A, Schlessinger D: Mutations in GPC3, a glypican gene, cause the Simpson-Golabi-Behmel overgrowth syndrome. Nat Genet 1996, 12(3):241-247. Toretsky JA, Zitomersky NL, Eskenazi AE, Voigt RW, Strauch ED, Sun CC, Huber R, Meltzer SJ, Schlessinger D: Glypican-3 expression in Wilms tumor and hepatoblastoma. J Pediatr Hematol Oncol 2001, 23(8):496-499. Xiang YY, Ladeda V, Filmus J: Glypican-3 expression is silenced in human breast cancer. Oncogene 2001, 20(50):7408-7412. MaizeGDB - Maize Genetics and Genomics Database [http:/ /www.maizegdb.org] Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Nguyen LP, Jovanovich SB, Gates CM, Feldman RA, Spudich JL, Spudich EN, DeLong EF: Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science 2000, 289(5486):1902-1906. Pop M, Salzberg SL, Shumway M: Genome sequence assembly: Algorithms and issues. Computer 2002, 35(7):47-+. Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus. Genome Res 2004, 14(1):149-159. Benchmark Data for Genome Assembly [http:// www.cbcb.umd.edu/research/benchmark.shtml] AMOS - A Modular Open-Source Assembler [http:// amos.sourceforge.net] Peltola H, Soderlund H, Ukkonen E: SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Res 1984, 12(1):307-321. Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA: Reducing storage requirements for biological sequence comparison. Bioinformatics 2004, 20(18):3363-3369. Myers EW: Toward Simplifying and Accurately Formulating Fragment Assembly. J Comp Bio 1995, 2(2):275-290. Myers EW: The fragment assembly string graph. Bioinformatics 2005, 21 Suppl 2:ii79-ii85. Running Celera Assembler: Trimming the input data [http/ www.cbcb.umd.edu/research/CeleraAssembler.shtml#trimmingth einput] Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics 2001, 17(12):1093-1104. Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 11 of 11 (page number not for citation purposes)