UCSC genome browser is one of the three most widely used genome databases in the world. It is convenient and consistent in design and very efficient. UCSC data are usually retrieved in two ways: from the Genome Browser where you get graphical view of a specified genomic region, or from Tables which facilitates data retrieval in batch. However, for a programmer like me, a thrid way is missing: data retrieval via APIs or command-line tools. I guess people think Ensembl is convenient mainly because it provides APIs to connect to the public Ensembl MySQL database.
In fact, UCSC also has a public MySQL database (genomep:email@example.com:3306), but it does not provide Perl or Java APIs. Here I will show how to connect to the UCSC MySQL database using Perl.
The perl script batchUCSC.pl is a general script to retrieve data in a region. You should refer to the Tables page for the description of each table. The script also predefines several rules to retrieve data. For example, the following command retrieves all the SNPs contained in the region `chr1:1-1000000': which is equivalent to:
For small tables like refGene, UCSC simply creates MySQL indices on
the chromosomal coordinate of a record. For large tables like snp129,
UCSC cacluates a bin number for each record and only indexes
chromosome name and the bin number without indexing chromosomal
coordinates. Doing in this way is faster than simply relying on the
MySQL indices and saves disk space as well. Detailed algorithm can be
found in Kent