views:

44

answers:

4

I'm trying to query a database, then convert the file-like object it returns to an XML document. Here's what I've been doing:

>>> import urllib, xml.dom.minidom
>>> query = "http://sbol.bhi.washington.edu/openrdf-sesame/repositories/sbol_test?query=select%20distinct%20%3Fname%20%3Ffeaturename%20where%20%7B%3Fpart%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23annotation%3E%20%3Fannotation%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23status%3E%20'Available'%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Fname.%3Fannotation%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23feature%3E%20%3Ffeature.%3Ffeature%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23type%3E%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23binding%3E%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Ffeaturename%7D"
>>> raw_result = urllib.urlopen(query)
>>> xml_result = xml.dom.minidom.parse(raw_result)

That last command gives me

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 4

Almost the same thing happens if I use xml.etree.ElementTree to do the parsing. I think they both use Expat. The weird part is, if instead of loading the file in python I just paste the query into Firefox, the resulting file can be read in perfectly well using open(path_to_file, "r").

Any ideas what this could be?

UPDATE: This is the first line of the file:

<?xml version='1.0' encoding='UTF-8'?> 

However that may not be what's in raw_result... that's what you get after downloading query-result.srx and changing the extension to .txt. The file extension doesn't matter does it? Also, I'm pretty new to this whole xml thing—why is column 4 the 8th character? – Jeff 0 secs ago edit

A: 

Any chance you could post the XML snippet? The parser is indicating that the error is happening at the very first line. My guess is the formatting is off or reporting incorrectly, which is causing EXPAT to pitch an exception right off the bat.

My guess is that first line violates something in the "well formed XML" content anwyay. For reference, you might compare against http://en.wikipedia.org/wiki/XML

heckj
A: 

Looks like something is wrong with your XML file, right about line 1, column 4.

I tried this, and what I got doesn't look like XML to me. Here are the first eight characters, as Alex suggested:

>>> raw_result.read(8)
'BRTR\x00\x00\x00\x03'
Fred Larson
A: 

Your server is picky about the accept header in deciding what to send back and in which format. The following should work:

In [265]: import urllib2

In [266]: req = urllib2.Request(query, headers={'Accept':'application/xml'})

In [267]: rsp = urllib2.urlopen(req)

In [268]: xml = minidom.parse(rsp)

In [268]: xml.toxml()[:64]
Out[268]: u'<?xml version="1.0" ?><sparql xmlns="http://www.w3.org/2005/spar'

Note the accept header in urllib2.Request.

ars
thanks, that works perfectly.
Jeff
A: 

It seems that the RDF server is delivering plain text to your urllib.urlopen call.

You should be able, with setting the right header

Accept: application/sparql-results+xml, */*;q=0.5

, to get the xml response. You have to read the RDF protocol specification of openRDF for details - there is for openRDF more than one format.

zovision