Background

UCSC genome browser is one of the three most widely used genome databases in the world. It is convenient and consistent in design and very efficient. UCSC data are usually retrieved in two ways: from the Genome Browser where you get graphical view of a specified genomic region, or from Tables which facilitates data retrieval in batch. However, for a programmer like me, a thrid way is missing: data retrieval via APIs or command-line tools. I guess people think Ensembl is convenient mainly because it provides APIs to connect to the public Ensembl MySQL database.

In fact, UCSC also has a public MySQL database (genomep:password@genome-mysql.cse.ucsc.edu:3306), but it does not provide Perl or Java APIs. Here I will show how to connect to the UCSC MySQL database using Perl.


Perl script for connecting UCSC MySQL

batchUCSC.pl

The perl script batchUCSC.pl is a general script to retrieve data in a region. You should refer to the Tables page for the description of each table. The script also predefines several rules to retrieve data. For example, the following command retrieves all the SNPs contained in the region `chr1:1-1000000': which is equivalent to:

A note on binning

For small tables like refGene, UCSC simply creates MySQL indices on the chromosomal coordinate of a record. For large tables like snp129, UCSC cacluates a bin number for each record and only indexes chromosome name and the bin number without indexing chromosomal coordinates. Doing in this way is faster than simply relying on the MySQL indices and saves disk space as well. Detailed algorithm can be found in Kent et al. (2002) or from the SAM format specification. You can also read the source code of this script to see how the bin number is calculated and used in query.