Introduction
When studying human genetics, we all use the human reference genome. However, not everyone understands what this genome contains and how to use it effectively for different purposes. This article aims to provide a practical guide to the human reference genome.
Understanding GRCh37
The human reference genome sequence is maintained by the Genome Reference Consortium (GRC). As of now, the latest major release is GRCh37. We may also frequently talk about hg19, but it is derived from GRCh37, not the official release.
The following nested list gives the composition of GRCh37. I am explaining in a simple but imprecise manner. Please refer to the GRC assembly termonology page for precise definitions. Below, quoted sentences are from that web page.
- Primary assembly: the best known
assembly of a haploid genome.
- Chromosome assembly: A sequence with known physical location (e.g. according to a physical map).
- Unlocalized sequence: "A sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome".
- Unplaced sequence: "A sequence found in an assembly that is not associated with any chromosome".
- Alternate loci: "A sequence that provides an alternate
representation of a locus found in a largely haploid assembly". Alternate loci are mostly sequences in regions known to
be highly polymorphic. GRCh37 contains 9 alternate sequences from the following 3 regions:
- 6:28,477,796-33,448,353 -- the MHC region
- 17:43,384,863-44,913,631 -- long inversion
- 4:69,170,076-69,878,206
- Patches: "A contig sequence that is released outside of the full
assembly release cycle". Patches consist of FIX and NOVEL.
- "FIX patches are released to correct an error in the assembly and will be removed when the new full assembly is released".
- "NOVEL sequences are sequences that were not in the last full assembly release and will be retained with the next full assembly release".
It is worth noting that both alternate loci and patches contain long flanking sequences nearly identical to the primary assembly. They are not non-redundant with respect to the primary assembly.
Constructing the reference genome for read mapping
For screening human contaminations, for example for a metagenomics project, using the entire GRCh37, including the primary assembly, alternate loci and patches, is preferred.
For variant discovery, RNA-seq and ChIP-seq, it is recommended to use the entire primary assembly, including assembled chromosomes AND unlocalized/unplaced contigs, for the purpose of read mapping. Not including unlocalized and unplaced contigs potentially leads to more mapping errors. To avoid naming and chromosome ordering issues, I HIGHLY recommend to use the version of the reference sequence used by the 1000 Genomes Project (here is the README). This version replace the mitochondrial sequence with the revised Cambridge Reference Sequences (rCRS; AC:NC_012920), the mostly widely used version among mitochondria specialists.