PHASE is the gold standard in phasing genotype data. It is proved to be the most accurate program for such task. However, PHASE is a bit slow to be applied on hundreds of markers given hundreds of individuals, which leads to the development of several new phasing programs including BACH, BEAGLE, fastPHASE, HaploRec, HIT and MACH. How do they perform? This page, which is largely inspired by Rastas et al. (2008), will give you a rough idea.


Coalescent simulation

Coalescent simulation was done by MaCS with the command line adapted from Plagnol and Wall (2006). 100 European and 100 African haplotypes of length 100kbp were generated, comprising of 100 diploid individuals in total. All the segregating sites were retained without adding missing genotypes. Individuals from the two populations were mixed together and fed to the phasing programs as input. The MaCS command line is:

It is important to note that the performance of a phasing program may vary a lot given different types of input data. Here I do not mean to provide a comprehensive benchmark on various conditions, but to give a naive view of how these programs differ from each other in one experiment which could be more realistic if:


Fast Phasing Programs


Measurement

The most widely used measurement for evaluating phasing accuracy is switch error rate (Stephens and Donnelly, 2003; Marchini et al., 2006) which equals the switch distance between the true phased sequences and the inferred phased sequences divided by the maximum possible switch distance. Given two pairs of phased sequences, the switch distance between them is the minimum number of switches required to produce one pair of phased sequences from the other pair. The maximum possible switch distance equals the number of heterozygotes minus the number of diploid individuals. Switch error rate is zero if all the heterozygotes are phased correctly.


Evaluation

First of all, the accuracy of a program is affected by the command line options. Usually the more iterations and more hidden states are used, the more accurate the results. Specifically, fastPHASE's speed is linear to -T and -C; MACH's speed is linear to --rounds and quadratic to --states. I am not sure how speed is scaled to other command-line options without carefully studying the algorithms behind.

ProgramVersionCPU timeErrorAdditional Options
BEAGLE3.0.2390.092nsamples=25
fastPHASE1.4.04000.086-K10 -T10 -C25
MACH1.0.1617880.066--rounds 50 --states 200

(To be continued...)