views:

243

answers:

2

Hi! I'm working on a simple crawler in Python. The aim is to create a sitemap.xml. (you can find the very alpha version here: http://code.google.com/p/sitemappy/) I noticed that if I generate the xml with URLs containing non HTML entities (such as &), the xml doesn't validate and it isn't accepted by Google Webmaster Tools. Is there a quick way to encode the querystring part of the URLs?

Thank you!

Matteo

+4  A: 

cgi.escape to the rescue:

cgi.escape(s[, quote])

Convert the characters '&', '<' and '>' in string s to HTML-safe sequences. Use this if you need to display text that might contain such characters in HTML. If the optional flag quote is true, the quotation mark character ('"') is also translated; this helps for inclusion in an HTML attribute value, as in . If the value to be quoted might include single- or double-quote characters, or both, consider using the quoteattr() function in the xml.sax.saxutils module instead.

Quick interactive check:

>>> import cgi
>>> cgi.escape('<&>')
'&lt;&amp;&gt;'
>>>
gimel
+2  A: 

Saxutils has an escaping function for XML entities:

>>> from xml.sax import saxutils
>>> saxutils.escape("&")
'&amp;'
Ben James