Overview

UCSC Genome Bioinformatics provides sql tables and fasta files for many genomes. pyucsc provides a lightweight python interface to these resources that can provide an SQL interface (using SqlAlchemy ) and fast access to the DNA sequences. It ties these together with a set of model objects that are loaded from the database and can be directly interrogated for the DNA sequence. All genomic intervals are described using fastinterval which provides convenient interval operations and fast sequence loading via pyfasta.

Choosing a genome

We first need to create an SqlAlchemy Session and fastinterval Genome. This requires that you have configured the database server and directory to use (see Configuration):

>>> import ucsc
>>> session, genome = ucsc.use('hg19')

DNA intervals and Fasta files

The core of the interface to the sequence is the fastinterval.Interval class. All genomic locations are described with this class:

>>> i1 = genome.Interval(10182300,10182320, chrom='chr3')
>>> i1.sequence
'agctcactgcaacctccgcc:'
>>> str(i1)
'chr3:10182300-10182320:'

Strands are handled and reverse complements generated correctly

>>> i2 = genome.Interval(10182300, 10182320, chrom='chr3', strand=-1)
>>> i2.sequence
'ggcggaggttgcagtgagct'

Interval logic is available as methods on the Interval class

>>> i3 = Interval(10182310,10182330, chrom='chr3')
>>> i1.overlaps(i3)
True
>>> i1.contains(i3)
False
>>> i1.intersection(i3)
Interval(10182310, 10182320)
>>> i1.span(i3)
Interval(10182300, 10182330)
>>> i1.union(i3)
Interval(10182300, 10182330)

For full details, see the fastinterval documentation.

SQL tables

pyucsc uses SqlAlchemy to expose the ucsc database tables:

>>> from ucsc import tables

e.g. to count the entries in knownGene:

>>> tables.knownGene.count().execute().scalar()
77614L

or to get genes:

>>> tables.knownGene.select().limit(2).execute().fetchall()
[('uc001aaa.3', 'chr1', '+', ...),
 ('uc010nxq.1', 'chr1', '+', ...)]

See the SqlAlchemy SQL core tutorial for more information.

The foreign key relationships are not defined by UCSC in the schema. This means that, for the moment, you must manually specify conditions when constructing a join.

Model objects

Alternatively, we can use the pyucsc model objects which provide a more natural python interface to the database tables, with convenience methods such as creating appropriate intervals:

>>> from ucsc import model
>>> vhl = session.query(model.KnownGene).filter(model.KnownGene.geneSymbol=='VHL').one()
>>> vhl
KnownGene(VHL, uc003bvc.2, chr3:10183318-10193744)

To get the transcript:

>>> vhl.transcript
Interval(10183318, 10193744)
>>> vhl.transcript.sequence
'CCTCGCCTCCGTTACAACGGCCTACGGTGCTG...'

Snp queries:

>>> model.Snp.for_interval(vhl.transcript)
[SNP(rs779805, chr3:10183336-10183337),
 SNP(rs34271731, chr3:10183434-10183435),
 ... ]

For the full model documentation, see Table Models.

Configuration

You need to provide a local copy of the Fasta files and the database server to use. You can use the public UCSC mysql server, but please respect their usage policy. Configuration is achieved either through a configuration file, or by setting variables in ucsc.config.

Via Configuration Files

Create a YAML file in either /etc/pyucsc or ~/.pyucsc with two entries:

fasta_dir: /fasta
database_uri:  mysql://genome:@datarig.local/

Via code:

from ucsc import config
config.fasta_dir = "/fasta"
config.database_uri = "mysql://genome:@datarig.local/"

Status

The table interfaces are generated by introspection and therefore complete. The model interface only covers a limited set of tables, but it is easy to add new classes and mappings.

Development

Please use the github repository for issues and patches: https://github.com/PopulationGenetics/pyucsc