SNPable Regions

(Unfinished page!)

Procedure

The following gives the procedure to generate the mask for single-end reads of length `k' and stringency `r'. Here we take k=35 and r=0.5. All the source codes are available here, released under the MIT/X11 license.

Extract all overlapping k-mer subsequences as read sequences. Command: splitfa genome.fa 35 | split -l 20000000
Align all reads to the genome with BWA. Other aligners would work if they do global alignment w.r.t. reads and give suboptimal hits. The preferred command-line for aln is (NOT tested, though): bwa aln -R 1000000 -O 3 -E 3 genome.index xxaa > xxaa.sai The default command-line would also work, but in that configuration, you can only get the approximate number of 1-mismatch hits rather than 1-difference hits; in addition, suboptimal hits would not be counted if there are more than 30 perfect hits (for BWA-0.4.9).
Suppose the unsorted BWA alignment results are xx??.sam.gz. Generate rawMask with: gzip -dc xx??.sam.gz | gen_raw_mask.pl > rawMask_35.fa When you read a sequence in rawMask_35.fa into a C array seq[], you can get the approximate number of perfect hits (c1) and the approximate number of 1-mismatch hits (c2) for the k-mer at [x,x+k-1] with the following C code (NB: 63 to bypass the `>' symbol which may cause problem in a FASTA file): int c1, c2, c = seq[x] - 63; c1 = c >> 3; c2 = c & 7; c1 = c1? 1<<(c1-1) : 0; c2 = c2? 1<<(c2-1) : 0; Note that c1 may be zero in long `N' regions; if you use the default configuration of BWA-0.4.9, c2 is always zero for c1>30.
Generate the final mask: gen_mask -l 35 -r 0.5 rawMask_35.fa > mask_35_50.fa The format of this file is described in the following section.

Using the Mask

Acquiring the mask files

For the moment, the 35-mer rawMask file for the human genome is available here and the (35,0.5) mask here. You can easily generate mask for different r from the rawMask with my programs.

File format

File mask_35_50.fa is a fasta-like file with sequences composed of 0, 1, 2 and 3. Given a character c at position x:

c=3: the majortiy of overlapping 35-mers are mapped uniquely and without 1-mismatch (or 1-difference, depending on the BWA command line) hits.
c=2: the majority of overlapping 35-mers are unique and c!=3.
c=1: the majority of overlapping 35-mers are non-unique.
c=0: all the 35-mers overlapping x cannot be mapped due to excessive ambiguous bases.

Applying the mask

Given the genome file genome.fa, change all bases corresponding to c!=3 to lowercases. apply_mask_s mask_35_50.fa genome.fa > genome.mask.fa Given a list of sites with the first two columns describing 1-based chromosomal positions, filter out all sites corresponding to c!=3: apply_mask_l mask_35_50.fa in.list > out.list

SNPable Regions in the Human Genome

35bp single-end, default BWA configuration:

-r	2-away	1-away	HapMap2
0.1	86.0%	91.6%
0.2	84.8%	90.6%
0.3	83.2%	89.3%
0.4	81.9%	88.3%
0.5	80.3%	86.8%	98.6%
0.6	79.0%	85.6%
0.7	77.1%	83.7%
0.8	75.6%	82.2%	96.5%
0.9	73.1%	79.7%
1.0	70.7%	77.4%