views:

60

answers:

2

How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.

Analogously, is there a programmatic python interface for getting phastCons scores?

thanks.

+1  A: 

Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.

slacy
It does, but I can see it only supports reading of records from FASTA, not fetching of sequences... if you wanted to fetch a sequence (start, end) from FASTA, you'd need a format with no new lines and the right interface and I don't think BioPython supports that. Maybe I missed something though - could you point to the relevant doc on this? Thank you!
+1  A: 

See my answer to your question over at Biostar:

http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc

Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:

region = rec.seq[start:end]

to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.

Brad Chapman
I agree that this approach is really elegant because you have a standard library to use but I found it to be very slow. If you assume a fasta file with no newlines, you can simply "seek" the coordinates in the file which I think is much faster, and it doesn't require you to load all the fasta files from every chromosome into memory. Is there a way to achieve the same kind of efficiency with a standard library like biopython? thanks.
It's not exactly clear what you are looking for, but I agree that a custom solution tailored to your specific files will be faster than a more general solution. In practice most FASTA files have line breaks and what not, so I prefer being general but your experience may vary.
Brad Chapman