views:

973

answers:

2

When I try to retrive information from google weather api with the followign url,

http://www.google.com/ig/api?weather=Munich,Germany&hl=de

and then try to parse it with minidom, I get error that the document is not well formed.

I use following code

sock = urllib.urlopen(url) # above mentioned url
doc = minidom.parse(sock)

I think the german characters in the response is the cause of the error.

What is the correct way of doing this ?

+2  A: 

This seems to work:

sock = urllib.urlopen(url)
# There is a nicer way for this, but I don't remember right now:
encoding = sock.headers['Content-type'].split('charset=')[1]
data = sock.read()
dom = minidom.parseString(data.decode(encoding).encode('ascii', 'xmlcharrefreplace'))

I guess minidom doesn't handle anything non-ascii. You might want to look into lxml instead, it does.

Lennart Regebro
Quote from http://evanjones.ca/python-utf8.html: "Minidom can handle any format of byte string, such as Latin-1 or UTF-16. However, it will only work reliably if the XML document has an encoding declaration (eg. <?xml version="1.0" encoding="Latin-1"?>). If the encoding declaration is missing, minidom assumes that it is UTF-8. In is a good habit to include an encoding declaration on all your XML documents, in order to guarantee compatability on all systems."
ChristopheD
The lxml recommendation is a good one though...
ChristopheD
+1  A: 

The encoding sent in the headers is iso-8859-1 according to python's urllib.urlopen (although firefox's live http headers seems to disagree with me in this case - reports utf-8). In the xml itself there is no encoding specified --> that's why xml.dom.minidom assumes it's utf-8.

So the following should fix this specific issue:

import urllib
from xml.dom import minidom

sock = urllib.urlopen('http://www.google.com/ig/api?weather=Munich,Germany&amp;hl=de')
s = sock.read()
encoding = sock.headers['Content-type'].split('charset=')[1] # iso-8859-1
doc = minidom.parseString(s.decode(encoding).encode('utf-8'))

Edit: I've updated this answer after the comment of Glenn Maynard. I took the liberty of taking one line out of the answer of Lennert Regebro.

ChristopheD
`Content-Type: text/xml; charset=ISO-8859-1`
Glenn Maynard
Very strange. On Firefox 3.013 (Linux) the Live HTTP Headers plugin reports Content-Type: text/xml; charset=UTF-8.The headers on the urllib.urlopen handle report the iso-8859-1 though. I should probably update the code then.
ChristopheD
i selected this answer because I read it first and it worked ! both the answers are great ! thanks.
rangalo