The following gives the procedure to generate the mask for single-end
reads of length `k' and stringency `r'. Here we take
k=35 and r=0.5. All the source codes are
available here,
released under the MIT/X11 license.
Extract all overlapping k-mer subsequences as read
sequences. Command:
splitfa genome.fa 35 | split -l 20000000
Align all reads to the genome with BWA. Other aligners would work
if they do global alignment w.r.t. reads and give suboptimal
hits. The preferred command-line for aln is (NOT tested,
though):
The default command-line would also work, but in that configuration,
you can only get the approximate number of 1-mismatch hits rather
than 1-difference hits; in addition, suboptimal hits would not be
counted if there are more than 30 perfect hits (for BWA-0.4.9).
Suppose the unsorted BWA alignment results
are xx??.sam.gz. Generate rawMask with:
When you read a sequence in rawMask_35.fa into a C
array seq[], you can get the approximate number of perfect
hits (c1) and the approximate number of 1-mismatch hits
(c2) for the k-mer at [x,x+k-1] with the
following C code (NB: 63 to bypass the `>' symbol which may cause
problem in a FASTA file):
int c1, c2, c = seq[x] - 63;
c1 = c >> 3; c2 = c & 7;
c1 = c1? 1<<(c1-1) : 0; c2 = c2? 1<<(c2-1) : 0;
Note that c1 may be zero in long `N' regions; if
you use the default configuration of BWA-0.4.9, c2 is
always zero for c1>30.
The format of this file is described in the following section.
Using the Mask
Acquiring the mask files
For the moment, the 35-mer rawMask file for the human genome is
available here
and the (35,0.5)
mask here. You
can easily generate mask for different r from the rawMask
with my programs.
File format
File mask_35_50.fa is a fasta-like file with sequences
composed of 0, 1, 2 and 3. Given a character c at
position x:
c=3: the majortiy of overlapping 35-mers are mapped
uniquely and without 1-mismatch (or 1-difference, depending on the
BWA command line) hits.
c=2: the majority of overlapping 35-mers are unique and c!=3.
c=1: the majority of overlapping 35-mers are non-unique.
c=0: all the 35-mers overlapping x cannot be
mapped due to excessive ambiguous bases.
Applying the mask
Given the genome file genome.fa, change all bases
corresponding to c!=3 to lowercases.