How can I extract DNA sequence using a Perl script from genome browser (UCSC), if I have their coordinates?
views:
199answers:
1You can pipe a DAS sequence request into a Perl script that parses out the XML element containing the sequence.
For example, the following is a curl
request of UCSC's DAS server, throwing away the standard error, piped to parseSeq.pl
:
$ curl http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=1:10000,10999 2>/dev/null | parseSeq.pl
The output of curl
will be an XML document containing the 1000-base DNA sequence from the hg19
assembly of the human genome. The request asks for base 10000 to 10999 (remember that UCSC is 0-based) from the first chromosome. The XML will include some other stuff useful for logging and error checking.
After piping XML into a Perl script, you can use Perl's XML::Simple module to quickly parse out the stuff you want.
To help you get started, your parseSeq.pl
file might start with:
#!/usr/bin/perl -w
use strict;
use XML::Simple;
use Data::Dumper;
my $xml = new XML::Simple;
my $ref = $xml->XMLin('-');
print Dumper $ref;
The output of this should give you enough of a start to pull the DNA sequence from $ref
.