views:

199

answers:

1

How can I extract DNA sequence using a Perl script from genome browser (UCSC), if I have their coordinates?

+5  A: 

You can pipe a DAS sequence request into a Perl script that parses out the XML element containing the sequence.

For example, the following is a curl request of UCSC's DAS server, throwing away the standard error, piped to parseSeq.pl:

$ curl http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=1:10000,10999 2>/dev/null | parseSeq.pl

The output of curl will be an XML document containing the 1000-base DNA sequence from the hg19 assembly of the human genome. The request asks for base 10000 to 10999 (remember that UCSC is 0-based) from the first chromosome. The XML will include some other stuff useful for logging and error checking.

After piping XML into a Perl script, you can use Perl's XML::Simple module to quickly parse out the stuff you want.

To help you get started, your parseSeq.pl file might start with:

#!/usr/bin/perl -w                                                                                                                                                                                                                          

use strict;                                                                                                                                                                                                                                 
use XML::Simple;                                                                                                                                                                                                                            
use Data::Dumper;                                                                                                                                                                                                                           

my $xml = new XML::Simple;                                                                                                                                                                                                                  
my $ref = $xml->XMLin('-');                                                                                                                                                                                                                       

print Dumper $ref;

The output of this should give you enough of a start to pull the DNA sequence from $ref.

Alex Reynolds
Thank you very much!
fed