Software
Currently, this page only includes software I am familiar with. Most
of them aim for aligning next-generation sequencing (NGS) data and
were developed since 2007. I may extend the list when I have
time. Several notes:
- The programs are listed in the alphabet order in each category.
- Features shown in brackets are optional and may affect efficiency.
- The version number shown for each program is the one I have
checked, but may not be the latest.
Indexing Reads with Hash Tables
- CloudBurst
[PMID:19357099]. RMAP-like
algorithm that works in a cloud.
- Platform: Illumina
- Features: support cloud computation
- Availability: open source
- Cross_match [1.080730]. The
latest cross_match has been substantially improved for short read
alignment. Its speed is comparable to other aligners and might be
the best choice for local alignment.
- Platform: Illumina; 454
- Features: gapped alignment (maximum 2 gaps in the fast mode);
local alignment
- Availability: academic free source codes
- Eland [1.0]. Probably the first short read aligner. Eland
substantially influences many aligners in this category and still
outperforms many followers. Although it is not the fastest any more,
it is close to the fastest and has the smallest memory
footprint. Eland itself works for 32bp single-end reads
only. Additional Perl scripts in GAPipeline extend its ability.
- Platform: Illumina
- Features: PET mapping; mapping quality; SNP caller; counting
suboptimal occurrences.
- Advantages: fast; light-weighted
- Availability: free source codes for machine buyers.
- MAQ [0.7.1,
PMID: 18714091]. This
is my program to align short reads and to call variants. It has been
used in several high-profile papers.
- Platform: Illumina; SOLiD (partial)
- Features: PET mapping; quality aware; gapped alignment for
PET; mapping quality; adapter trimming; partial occurrences
counting; SNP caller
- Advantages: feature rich; publication proved
- Limitation: up to 128bp reads; no gapped alignment for single-end reads
- Availability: GPL
- mrFast/mrsFAST
[0.5.1]. An aligner specifically designed for reporting all hits.
- Platform: Illumina
- Features: all hits; up to 3X faster than MAQ (not tested by
myself); gapped alignment (mrFAST)
- Availability: Free binary
- RazerS [20081029, PMID:
19592482]. q-gram filteration; based on
the SeqAn library.
- Availability: free source codes
- RMAP [0.41,
PMID: 18307793;19736251]. One
of the earliest short read aligners.
- Platform: Illumina
- Features: quality aware; [gapped alignment]; best unique hits
- Availability: GPL
- SeqMap
[1.0.8,
PMID: 18697769].
An Eland-like program.
- Platform: Illumina
- Features: [gapped alignment]
- Limitation: not counting suboptimal hits
- Availability: GPL
- SHRiMP
[1.10, PMID: 19461883]. Q-gram based algorithm.
- Platform: SOLiD; Illumina; 454
- Features: SOLiD mapping; gapped alignment; potential support
for mapping quality
- Limitations: a little slow
- Availability: GPL
- ZOOM
[1.2.5,
PMID: 18684737]. Eland-like
algorithm with the improvement of using spaced seed. ZOOM supports
longer reads and faster than Eland, although it uses more
memory. ZOOM is feature rich, but some features may come at the cost
of speed.
- Platform: Illumina; SOLiD
- Features: PET mapping; SOLiD mapping; [gapped alignment];
[mapping quality]; [quality aware]
- Advantage: fast; feature rich
- Limitation: up to 224bp reads; gapped alignment comes with cost
- Availability: commercial
Indexing Genome with Hash Tables
- BFAST [0.3.1, PMID:
19907642].
- Platform: Illumina; SOLiD
- Availability: open source
- Comment on paper: evaluation for bowtie and bwa may be questionable.
- gnumap [PMID: 19861355].
- Platform: Illumina
- Features: "Assigning a proportion of the read to relevant
genomic matches based on the relative likelihood that the read
maps to each location". (I do not know how this is compared with
randomly distribute repetitive reads)
- Comment on paper: It is possible to achieve paired-end mapping
by indexing reads only. Maq does in this way.
- Availability: open source (?)
- MOM [0.1,
PMID: 19228804].
- Platform: Illumina; (?)
- Features: counting suboptimal occurrences; local alignment
- Availability: free
- Mosaik
[1.0]. Mosaik has been used in several high-profile publications and
delivers good performance.
- Platform: Illumina; 454; SOLiD
- Advantages: long reads
- Availability: open source
- NovoAlign [2.0]. NovoAlign
competes with MAQ on speed and feature set, and may be more accurate
than MAQ. It also implements several important features missing in
MAQ.
- Platform: Illumina
- Features: PET mapping; gapped alignment; mapping quality;
quality aware; adapter trimming; MAQ format
- Advantages: highly accurate; gapped alignment; feature rich
- Requirements: >8GB RAM for paired-end mapping against the
human genome.
- Availability: proprietary; academic free binary (no
multi-threading support)
- PASS [0.5,
PMID: 19218350].
- Platform: Illumina; SOLiD; 454
- Features: PET mapping
- Advantages: long reads
- Requirement: >15GB RAM against human genome
- Availability: free source codes to academic users
- PerM [0.1.0, PMID: 19675096]
- Platform: Illumina; SOLiD
- Advantages: fast
- Availability: GPL
- Limitation: no paired-end mapping apparently.
- Requirement: 4.5 bytes per reference base
- Comment on paper: PerM is very fast. The authors attribute its speed to
the use of spaced seeds with higher weight. This is a reason,
but to me, not the leading reason. I think PerM is fast mainly
because in building index, it aligns the genome against itself
under given a specified read length; in alignment, PerM aligns a
repetitive read once rather than to each copy. The cost is a
user needs to build a huge index for each read length.
- SOAPv1 [1.11,
PMID: 18227114]. The first published short
read aligner.
- Platform: Illumina
- Features: PET mapping; adapter trimming; gapped alignment; SNP
caller; counting occurrences
- Advantages: feature rich
- Requirements: >14GB RAM against human genome
- Availability: GPL
Merge Sorting
- Slider
[0.6,
PMID: 18974170]. A
very clever short read aligner specifically designed for Illumina
reads. It is able to use the second best base call, which
potentially improves the accuracy on SNP finding.
- Platform: Illumina
- Features: Using second base
- Advantages: fast; potentially more accurate on SNP discovery
- Requirements: >160GB disk space
- Availability: free source codes
- Slider II [1.1].
Indexing Genome with Suffix Array/BWT
- Bowtie
[0.9.9, PMID: 19261174]. This is probably the fastest short read aligner to
date. Although under the default option Bowtie does not guarantee to
find the best hit or tell if the hit it finds is unique, it is
possible to improve this behaviour at the cost of speed.
- Platform: Illumina
- Features: partial PET mapping; quality aware; [mapping quality]
- Advantages: very fast
- Availability: GPL
- BWA [0.5.1,
PMID: 19451168]. Another aligner written by me. Given high-quality
reads, it is an order of magnitude faster than MAQ while achieving
similar alignment accuracy.
- Platform: Illumina; SOLiD; 454; Sanger
- Features: PET mapping (short reads only); gapped alignment;
mapping quality; counting suboptimal occurrences (short reads
only); SAM output
- Advantages: fast
- Limitations: short read algorithm is slow for long reads and
reads with high error rate
- Availability: GPL
- SOAPv2 [2.19, PMID:
19497933]. A marvelous program developed by the group who
wrote BWT-SW.
- Platform: Illumina
- Features: PET mapping; mapping quality; counting occurrences
- Advantages: fast
- Availability: academic free binary
- segemehl
[0.0.7, PMID: 19750212].
- Platform: Illumina
- Features: accurate
- Limitation: large memory requirement; no paired-end mapping
- Comment on paper: the authors show maq and bwa are not as
accurate probably because they were counting ambiguous alignments.
- vmatch
[SpringerLink].
- Availability: academic free binary
Recommendation
First of all, as I am the key developer of two short read aligners
(BWA and MAQ), it is really hard for me to give an unbiased
evaluation. Please bear this fact in mind when reading through my
comments below.
For Illumina reads, I would recommend my program BWA. BWA implements
most of the major features of a practical aligner. It is relatively
small in memory and highly efficient with little tradeoff on
accuracy. BWA outputs alignment in the SAM format. Users may
use SAMtools to
sort/merge alignments and to make variants calls. One potential
concern about BWA is it has not been widely used at the moment. It may
be less robust than those publication-proved aligners such as Eland
and MAQ.
[Update: With the help of paired-end reads, MAQ is able to
find some SNPs at the edge of highly repetitive regions. However, BWA
cannot. Nonetheless, I still prefer BWA given its speed and the fact
that SNPs that can be called from repeats are rare and more likely to
be false positives.]
Mapping inconsistent read pairs with NovoAlign is recommended for
PET-based structural varition detection where alignment accuracy is
the leading factor on reducing false positive calls. NovoAlign is the
most accurate aligner to date.