Software: Next-Generation Sequence Alignment Software

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

About People Publications Software Education Consulting Core

Software
Next-Generation Sequence Alignment Software
Computational Gene Finding
Genome Assembly and Large-Scale Genome alignment
Sequence Analysis Tools
Variant Analysis Tools
Webservers and Databases

Next-generation sequence alignment software

Bowtie An ultrafast, memory-efficient short read aligner that aligns short DNA sequences to the human genome at a rate of
about 25 million reads per hour on a typical desktop computer. Bowtie indexes the genome with a Burrows-Wheeler
index to keep its memory footprint small: 2.3 GB for the human genome. Bowtie and Bowtie2 were developed by
Ben Langmead and are actively supported by his lab.
TopHat A spliced alignment system for RNA-seq experiments. TopHat finds known and novel exon-exon splice junctions
and is extremely fast due to its use of the Bowtie2 aligner. The latest release, TopHat2, runs with either Bowtie1 or
Bowtie2 and includes new algorithms that significant enhance TopHat's sensitivity, particularly in the presence of
pseudogenes. TopHat2 includes TopHat-Fusion as an option.
TopHat- TopHat-Fusion is an enhanced version of TopHat with the ability to align reads across fusion points, which results
Fusion from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome.
HISAT HISAT is a new, highly efficient system for aligning RNA-seq reads. HISAT uses a new indexing scheme,
hierarchical indexing, which is inherently well-suited for aligning across introns. It employs two types of indexes for
alignment: (1) a whole-genome FM index to anchor each alignment, and (2) numerous local FM indexes for very
rapid extensions of these alignments. HISAT supports genomes of any size, including those larger than 4 billion
bases.
HISAT2 HISAT2 is a new, rapid and accurate system for aligning NGS reads (both DNA and RNA) against a population of
genomes. HISAT2 is a successor to both HISAT and TopHat2. In this program, we extended the Burrows-Wheeler
transform (BWT) and the Ferragina-Manzini (FM) index to incorporate genomic differences among individuals into
the reference genome.
Cufflinks A transcript assembler and abundance estimator for RNA-seq data. Cufflinks assembles transcripts from the
alignments produced by TopHat, including novel isoforms, and quantitates those transcripts. Cufflinks was originally
developed by Cole Trapnell and is supported by his lab at the University of Washington.
StringTie A new and fast transcript assembler and abundance estimator for RNA-seq data. Similar to Cufflinks, StringTie
assembles transcripts from the alignments produced by TopHat, including novel isoforms, and quantitates those
transcripts.
Ballgown A program for computing differentially expressed genes in two or more RNA-seq experiments, using the output of
StringTie or Cufflinks. The Ballgown package provides functions to organize, visualize, and analyze expression
measurements. Ballgown is written in R and is part of Bioconductor.
CloudBurst A program for highly sensitive short read mapping using MapReduce. CloudBurst, developed by Michael Schatz
(now a faculty member at JHU Computer Science) uses Hadoop to efficiently parallelize the short read mapping
problem to dozens or hundreds of computers. This enables CloudBurst to execute highly sensitive read mappings
with any number of mutations or indels.
Crossbow Crossbow is a scalable software pipeline for whole genome resequencing analysis. It combines Bowtie, an ultrafast
and memory efficient short read aligner, and SoapSNP, an accurate genotyper, within Hadoop to distribute and
accelerate the computation with many nodes. In the CrossBow paper, we used it to analyze 35x coverage of a
human genome in 3 hours for about $100 using a 40-node, 320-core cluster rented from Amazon's EC2 utility
computing service.
Diamund Diamund is a new, efficient algorithm for variant detection that compares DNA sequences directly to one another,
without aligning them to the reference genome.
EDGE-pro EDGE-pro is a program for estimating gene expression from prokaryotic RNA-seq. EDGE-pro uses Bowtie2 for
alignment but, unlike TopHat and Cufflinks, does not allow spliced alignments. It also handles overlapping genes, a
common phenomenon in bacteria that is largely absent in eukaryotes.
Kraken Kraken is a very fast system for taxonomic classification of short or long DNA sequences from a microbiome or
metagenomic sample. See the 2014 Genome Biology paper here.
Centrifuge Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial
samples, with better sensitivity than and comparable accuracy to other leading systems. Centrifuge requires a
relatively small index (e.g., 4.3 GB for ~4,100 bacterial genomes).

Computational Gene Finding

Glimmer A system that uses interpolated Markov models to find genes in microbial DNA. Used to annotate hundreds
(possibly thousands) of bacterial, archaeal, and viral genomes. Current version is 3.02.
GlimmerHMM A Generalized Hidden Markov Model gene-finder which makes use of the techniques implemented previously by
GlimmerM.

GeneSplicer A fast system for detecting splice sites in genomic DNA of various eukaryotes.
SIM4CC An accurate and efficient program to align cDNA sequences (mRNAs, ESTs) to genomic sequences, specifically
designed for cross-species alignment.
sim4db / leaff Fast high-throughput spliced alignment (sim4, sim4cc) and sequence indexing.
ASprofile A suite of programs for extracting, quantifying and comparing alternative splicing (AS) events from RNA-seq data.
JIGSAW A program that predicts gene models using the output from multiple sources of evidence, including other gene
finders, Blast searches, and other alignment data.

Genome assembly and large-scale genome alignment

MUMmer A system for aligning whole genomes, chromosomes, and other very long DNA sequences. MUMmer is also widely
used for comparing genome assemblies. NOTE: MUMmer has been at sourceforge since the early 2000's, but in
2016 we are moving it to Github, and a new version, MUMmer4, will appear soon.
MUMmerGPU High throughput sequence alignment using Graphics Processing Units (GPUs). Uses a technique called general-
purpose GPU programming (GPGPU programming) to harness the extreme parallelism of GPUs for non-graphics
tasks. In this application, hundreds of query sequences are simultaneously aligned to a reference sequence,
creating an order of magnitude speed up over the same alignment on the CPU.
GAGE A realistic assessment of genome assembly software in a rapidly changing field of next-generation sequencing.
GAGE-B An evaluation of contiguity and accuracy of assemblies of bacterial organisms that are generated by some of most
commonly used genome assemblers. GAGE-B follows the standards set by GAGE.
MaSuRCA MaSuRCA is a whole-genome assembler developed originally at the University of Maryland by James Yorke,
Aleksey Zimin, and their colleagues. Ongoing development is a joint effort between UMD and JHU, and new
modules coming soon include methods to create hybrid assemblies using both Illumina and PacBio data.
AMOS This is a set of tools, libraries, and freestanding genome assemblers, all open source. AMOS is an open consortium
Assembler started at The Institute for Genomic Research (TIGR) that grew to include the University of Maryland, Johns
project Hopkins University, The Karolinska Institutet, the Marine Biological Laboratory, and others
AMOScmp is a comparative genome assembler, which uses one genome as a reference on which to assemble another, closely
related species. See the journal paper here.
MINIMUS A small, lightweight assembler for small jobs such as assembling a viral genome, assembling a set of reads that
match a single gene, or other tasks that don't require the complex infrastructure of a large-genome assembler.
Hawkeye A visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting
assembly errors. All levels of the assembly data hierarchy are made accessible to users, along with summary
statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies
or interesting features to support the task at hand. Can be used to interactively analyze assemblies from many
popular assemblers on your desktop computer. See the journal paper here.
Quake A software package to detect and correct substitution sequencing errors in WGS data sets with deep coverage.
A fast, accurate program to increase the length of reads by overlapping and merging paired reads from fragments
FLASH
shorter than twice the length of reads. Primarily designed to merge Illumina paired reads.
Celera A whole genome assembler originally developed at Celera Genomics for the assembly of the human genome.
Assembler CeleraAssembler is an open-source project at SourceForge. The code has been actively maintained since 2005 by
researchers at CBCB and the Venter Institute (formerly known as TIGR, The Institute for Genomic Research).
ABBA Assembly Boosted By Amino acid sequence is a comparative gene assembler, which uses amino acid sequences
from predicted proteins to help build a better assembly. See the journal paper. Link for installation and more
information..
AutoEditor A tool for correcting sequencing and basecaller errors using sequence assembly and chromatogram data from
Sanger (1st generation) reads. On average, AutoEditor corrects 80% of erroneous base calls, with an accuracy of
99.99%.

Other sequence analysis tools


BRCA gene a computational screening test that takes the raw DNA sequence data from a whole-genome sequence of an
testing individual human and tests for each of 68 known mutations in the BRCA1 and BRCA2 genes.

DivE a software to find regions that evolve at a slower or faster rate than the neutral evolution rate in any clade of a
phylogeny of a set of very closely related species.
DupLoCut A software which computes ancestral gene orders under the duplication-loss evolutionary model.
ELPH A motif finder based on Gibbs sampling that can find ribosome binding sites, exon splicing enhancers, or regulatory
sites.
fqtrim a software utility for filtering and trimming high-throughput next-gen reads.
GFF utilities gffread: a program for filtering, converting and manipulating GFF files
gffcompare: a program for comparing, annotating, merging and tracking transcripts in GFF files
Insignia A comprehensive system for finding unique DNA sequences that can be used to identify any bacterial or virus
species or strain. Currently has over 13,000 species and strains in its database..
Kraken A fast system for taxonomic classification of short or long metagenomic DNA sequences.
Centrifuge A very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.
PhymmBL A one-stop system for taxonomically classifying metagenomic short reads.
OperonDB Software and a database of operons covering a large number of prokaryotic genomes. Described in M. Pertea et
al., Nucl. Acids Res 37 (2009), D479-D482.
A program for determining sites of RNA-DNA differences (RDDs) and candidate RNA editing sites from RNA-seq
rddChecker
data.
RepeatFinder an older system for finding and characterizing repetitive sequences in complete and partial genomes.
Scimm A tool for unsupervised clustering of metagenomic sequences using interpolated Markov models.
SEE ESE an online tool for identifying exon splicing enhancers (ESEs) in Arabidopsis and Drosophila.
TransTermHP A highly accurate program that finds rho-independent transcription terminators in bacterial genomes. The site
includes a database with pre-computed predictions for hundreds of species.

Variant Analysis Tools

Software to predict the functional sigificance of somatic missense mutations observed in the genomes of cancer
CHASM and
cells, and a database of pre-computed features of all possible amino acid substitutions at every position of the
SNVBox
annotated human exome.
Cancer-related analysis of variants toolkit. Web tool for functional predictions and annotations of both somatic and
CRAVAT
germline variants.
An application for genome-wide studies by efficiently running several gene based analysis methods simultaneously
FAST
on the same data set.
LS-SNP/PDB Web tool for structural annotations and visualizations of missense variants in dbSNP.
muPIT Web tool for interactive structural annotations and visualizations of non-synonymous variation/mutation on proeins.

Other web servers and databases

ARDB New in early 2009 Antibiotic Resistance Genes Database


EnteriX Web servers for displaying alignments and annotations of bacterial genomes.
A collection of links (now very old) to external sequence analysis programs.

The Center for Computational Biology at Johns Hopkins University


Home News Contact Us

You might also like