When studying human genetics, we all use the human reference genome. However, not everyone understands what this genome contains and how to use it effectively for different purposes. This article aims to provide a practical guide to the human reference genome.

Understanding GRCh37

The human reference genome sequence is maintained by the Genome Reference Consortium (GRC). As of now, the latest major release is GRCh37. We may also frequently talk about hg19, but it is derived from GRCh37, not the official release.

The following nested list gives the composition of GRCh37. I am explaining in a simple but imprecise manner. Please refer to the GRC assembly termonology page for precise definitions. Below, quoted sentences are from that web page.

It is worth noting that both alternate loci and patches contain long flanking sequences nearly identical to the primary assembly. They are not non-redundant with respect to the primary assembly.

Constructing the reference genome for read mapping

For screening human contaminations, for example for a metagenomics project, using the entire GRCh37, including the primary assembly, alternate loci and patches, is preferred.

For variant discovery, RNA-seq and ChIP-seq, it is recommended to use the entire primary assembly, including assembled chromosomes AND unlocalized/unplaced contigs, for the purpose of read mapping. Not including unlocalized and unplaced contigs potentially leads to more mapping errors. To avoid naming and chromosome ordering issues, I HIGHLY recommend to use the version of the reference sequence used by the 1000 Genomes Project (here is the README). This version replace the mitochondrial sequence with the revised Cambridge Reference Sequences (rCRS; AC:NC_012920), the mostly widely used version among mitochondria specialists.