The following gives the procedure to generate the mask for single-end
reads of length `k' and stringency `r'. Here we take
k=35 and r=0.5. All the source codes are
available here,
released under the MIT/X11 license.
Extract all overlapping k-mer subsequences as read
sequences. Command:
Align all reads to the genome with BWA. Other aligners would work
if they do global alignment w.r.t. reads and give suboptimal
hits. The preferred command-line for aln is (NOT tested,
though):
The default command-line would also work, but in that configuration,
you can only get the approximate number of 1-mismatch hits rather
than 1-difference hits; in addition, suboptimal hits would not be
counted if there are more than 30 perfect hits (for BWA-0.4.9).
Suppose the unsorted BWA alignment results
are xx??.sam.gz. Generate rawMask with:
When you read a sequence in rawMask_35.fa into a C
array seq[], you can get the approximate number of perfect
hits (c1) and the approximate number of 1-mismatch hits
(c2) for the k-mer at [x,x+k-1] with the
following C code (NB: 63 to bypass the `>' symbol which may cause
problem in a FASTA file):
Note that c1 may be zero in long `N' regions; if
you use the default configuration of BWA-0.4.9, c2 is
always zero for c1>30.
Generate the final mask:
The format of this file is described in the following section.
Using the Mask
Acquiring the mask files
For the moment, the 35-mer rawMask file for the human genome is
available here
and the (35,0.5)
mask here. You
can easily generate mask for different r from the rawMask
with my programs.
File format
File mask_35_50.fa is a fasta-like file with sequences
composed of 0, 1, 2 and 3. Given a character c at
position x:
c=3: the majortiy of overlapping 35-mers are mapped
uniquely and without 1-mismatch (or 1-difference, depending on the
BWA command line) hits.
c=2: the majority of overlapping 35-mers are unique and c!=3.
c=1: the majority of overlapping 35-mers are non-unique.
c=0: all the 35-mers overlapping x cannot be
mapped due to excessive ambiguous bases.
Applying the mask
Given the genome file genome.fa, change all bases
corresponding to c!=3 to lowercases.
Given a list of sites with the first two columns describing 1-based
chromosomal positions, filter out all sites corresponding
to c!=3: