tags:

views:

2006

answers:

2

cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?

+6  A: 

cgi.escape is fine. It escapes:

  • < to &lt;
  • > to &gt;
  • & to &amp;

That is enough for all HTML.

EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:

data.encode('ascii', 'xmlcharrefreplace')

Don't forget to decode data to unicode first, using whatever encoding it was encoded.

However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).

Example:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.

nosklo
The additional boolean parameter to cgi.escape should also be considered for escaping quotes when text is used in HTML attribute values.
Greg Hewgill
Just to be sure: If I run all untrusted data through the `cgi.escape` function, is enough to protect against all (known) XSS attacs?
Tomas Sedovic
@Tomas Sedovic: Depends on where you'll put the text after running cgi.escape in it. If placed in root HTML context then yes, you're completely safe.
nosklo
What about input like {{Measures 12 Ω"H x 17 5/8"W x 8 7/8"D. Imported.}}That's not ascii, so encode() will throw an exception at you.
Andrew Kolesnikov
@Andrew Kolesnikov: Have you tried it? `cgi.escape(yourunicodeobj).encode('ascii', 'xmlcharrefreplace') == '{{Measures 12 Ω"H x 17 5/8"W x 8 7/8"D. Imported.}}'` -- as you can see, the expression returns ascii bytestring, with all non-ascii unicode chars encoded using the xml character reference table.
nosklo
+2  A: 

cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.

But you might have to also consider encoding issues: If the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.

Craig McQueen