UCSC Genome Bioinformatics provides sql tables and fasta files for many genomes. pyucsc provides a lightweight python interface to these resources that can provide an SQL interface (using SqlAlchemy ) and fast access to the DNA sequences. It ties these together with a set of model objects that are loaded from the database and can be directly interrogated for the DNA sequence. All genomic intervals are described using fastinterval which provides convenient interval operations and fast sequence loading via pyfasta.
Choosing a genome¶
We first need to create an SqlAlchemy Session and fastinterval Genome. This requires that you have configured the database server and directory to use (see Configuration):
>>> import ucsc >>> session, genome = ucsc.use('hg19')
DNA intervals and Fasta files¶
The core of the interface to the sequence is the fastinterval.Interval class. All genomic locations are described with this class:
>>> i1 = genome.Interval(10182300,10182320, chrom='chr3') >>> i1.sequence 'agctcactgcaacctccgcc:' >>> str(i1) 'chr3:10182300-10182320:'
Strands are handled and reverse complements generated correctly
>>> i2 = genome.Interval(10182300, 10182320, chrom='chr3', strand=-1) >>> i2.sequence 'ggcggaggttgcagtgagct'
Interval logic is available as methods on the Interval class
>>> i3 = Interval(10182310,10182330, chrom='chr3') >>> i1.overlaps(i3) True >>> i1.contains(i3) False >>> i1.intersection(i3) Interval(10182310, 10182320) >>> i1.span(i3) Interval(10182300, 10182330) >>> i1.union(i3) Interval(10182300, 10182330)
For full details, see the fastinterval documentation.
pyucsc uses SqlAlchemy to expose the ucsc database tables:
>>> from ucsc import tables
e.g. to count the entries in knownGene:
>>> tables.knownGene.count().execute().scalar() 77614L
or to get genes:
>>> tables.knownGene.select().limit(2).execute().fetchall() [('uc001aaa.3', 'chr1', '+', ...), ('uc010nxq.1', 'chr1', '+', ...)]
See the SqlAlchemy SQL core tutorial for more information.
The foreign key relationships are not defined by UCSC in the schema. This means that, for the moment, you must manually specify conditions when constructing a join.
Alternatively, we can use the pyucsc model objects which provide a more natural python interface to the database tables, with convenience methods such as creating appropriate intervals:
>>> from ucsc import model >>> vhl = session.query(model.KnownGene).filter(model.KnownGene.geneSymbol=='VHL').one() >>> vhl KnownGene(VHL, uc003bvc.2, chr3:10183318-10193744)
To get the transcript:
>>> vhl.transcript Interval(10183318, 10193744) >>> vhl.transcript.sequence 'CCTCGCCTCCGTTACAACGGCCTACGGTGCTG...'
>>> model.Snp.for_interval(vhl.transcript) [SNP(rs779805, chr3:10183336-10183337), SNP(rs34271731, chr3:10183434-10183435), ... ]
For the full model documentation, see Table Models.
You need to provide a local copy of the Fasta files and the database server to use. You can use the public UCSC mysql server, but please respect their usage policy. Configuration is achieved either through a configuration file, or by setting variables in ucsc.config.
Via Configuration Files
Create a YAML file in either /etc/pyucsc or ~/.pyucsc with two entries:
fasta_dir: /fasta database_uri: mysql://genome:@datarig.local/
from ucsc import config config.fasta_dir = "/fasta" config.database_uri = "mysql://genome:@datarig.local/"
The table interfaces are generated by introspection and therefore complete. The model interface only covers a limited set of tables, but it is easy to add new classes and mappings.