Phasing for Unrelated Individuals

PHASE is the gold standard in phasing genotype data. It is proved to be the most accurate program for such task. However, PHASE is a bit slow to be applied on hundreds of markers given hundreds of individuals, which leads to the development of several new phasing programs including BACH, BEAGLE, fastPHASE, HaploRec, HIT and MACH. How do they perform? This page, which is largely inspired by Rastas et al. (2008), will give you a rough idea.

Coalescent simulation

Coalescent simulation was done by MaCS with the command line adapted from Plagnol and Wall (2006). 100 European and 100 African haplotypes of length 100kbp were generated, comprising of 100 diploid individuals in total. All the segregating sites were retained without adding missing genotypes. Individuals from the two populations were mixed together and fed to the phasing programs as input. The MaCS command line is:

It is important to note that the performance of a phasing program may vary a lot given different types of input data. Here I do not mean to provide a comprehensive benchmark on various conditions, but to give a naive view of how these programs differ from each other in one experiment which could be more realistic if:

European and African populations were not mixed.
A better demographic model was used.
Not all segregating sites were included.
Not all genotypes were present.
Performance breakdown was shown given different sample sizes, LD, etc.

Fast Phasing Programs

BACH. BACH seems a bit slow on large data set.
BEAGLE. Beagle compresses the information in the data with a variable-length HMM. Evaluated.
fastPHASE. Based on Li and Stephens (2003) model. Evaluated.
HaploRec. I tried but could not get a sensible result. Probably it is my fault in making the input, but I could not evaluate it anyway.
HIT. The same reason as HaploRec.
MACH. Based on Li and Stephens model. Evaluated.
PHASE. Based on Li and Stephens model. It (v2.1) cannot handle "more than approximately 100 markers at once" according to Browning (2008) and is not evaluated here.

Measurement

The most widely used measurement for evaluating phasing accuracy is switch error rate (Stephens and Donnelly, 2003; Marchini et al., 2006) which equals the switch distance between the true phased sequences and the inferred phased sequences divided by the maximum possible switch distance. Given two pairs of phased sequences, the switch distance between them is the minimum number of switches required to produce one pair of phased sequences from the other pair. The maximum possible switch distance equals the number of heterozygotes minus the number of diploid individuals. Switch error rate is zero if all the heterozygotes are phased correctly.

Evaluation

First of all, the accuracy of a program is affected by the command line options. Usually the more iterations and more hidden states are used, the more accurate the results. Specifically, fastPHASE's speed is linear to -T and -C; MACH's speed is linear to --rounds and quadratic to --states. I am not sure how speed is scaled to other command-line options without carefully studying the algorithms behind.

Program	Version	CPU time	Error	Additional Options
BEAGLE	3.0.2	39	0.092	nsamples=25
fastPHASE	1.4.0	400	0.086	-K10 -T10 -C25
MACH	1.0.16	1788	0.066	--rounds 50 --states 200

(To be continued...)