(Unfinished page!)


Procedure

The following gives the procedure to generate the mask for single-end reads of length `k' and stringency `r'. Here we take k=35 and r=0.5. All the source codes are available here, released under the MIT/X11 license.


Using the Mask

Acquiring the mask files

For the moment, the 35-mer rawMask file for the human genome is available here and the (35,0.5) mask here. You can easily generate mask for different r from the rawMask with my programs.

File format

File mask_35_50.fa is a fasta-like file with sequences composed of 0, 1, 2 and 3. Given a character c at position x:

Applying the mask

Given the genome file genome.fa, change all bases corresponding to c!=3 to lowercases. Given a list of sites with the first two columns describing 1-based chromosomal positions, filter out all sites corresponding to c!=3:


SNPable Regions in the Human Genome

35bp single-end, default BWA configuration:

-r2-away1-awayHapMap2
0.186.0%91.6%
0.284.8%90.6%
0.383.2%89.3%
0.481.9%88.3%
0.580.3%86.8%98.6%
0.679.0%85.6%
0.777.1%83.7%
0.875.6%82.2%96.5%
0.973.1%79.7%
1.070.7%77.4%