ansaurus

Question

How do I perform HTML decoding/encoding using Python/Django?

Answer 1

A:

The question should be, why are you strings stored like that?

EDIT: I don't know why this is being down voted. It's a valid question, you're adding to the file size and adding zero security if you're just going to re-encode them.

Andrew G. Johnson 2008-11-08 20:47:33

I didn't down vote you, I agree it's a valid question. I explained why I am doing it this way. I would do the security checks after I have them encoded as html, and I'm not worried about file size. So, I'd still like to find out how to encode them as html.

rksprst 2008-11-08 20:58:16

This is probably being down-voted because it's not an answer. It should be a comment on the question.

cdleary 2009-01-16 01:00:21

@cdleary - were comments available on November 8 2008?

Andrew G. Johnson 2010-06-14 10:45:18

Answer 2

+1 A:

I found this in the Cheetah source code (here)

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

not sure why they reverse the list, I think it has to do with the way they encode, so with you it may not need to be reversed. Also if I were you I would change htmlCodes to be a list of tuples rather than a list of lists... this is going in my library though :)

i noticed your title asked for encode too, so here is Cheetah's encode function.

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

Cipher 2008-11-08 20:58:52

The list is reversed because decode and encode replacements always have to be made symmetrically. Without the reversing you could eg. convert 'lt;' to '', then in the next step incorrectly convert that to '<'.

bobince 2008-11-09 01:08:20

Answer 3

+10 A:

The Cheetah function should work, but is missing the single-quote. Use this tuple instead:

htmlCodes = (
    ('&', '&amp;'),
    ('<', '&lt;'),
    ('>', '&gt;'),
    ('"', '&quot;'),
    ("'", '&#39;'),
)

Here's Django's django.utils.html.escape function for reference:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

I also think it would make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible.

In addition, escaping only occurs in Django during template rendering. So to prevent escaping you just tell the templating engine not to escape your string: Use either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %} in your templates.

Daniel 2008-11-08 21:40:37

Reminds me to never use Django or Cheetah.

Ali A 2008-11-08 23:06:38

Why not use Django or Cheetah?

Mat 2009-02-07 21:26:43

Is there no opposite of django.utils.html.escape?

Mat 2009-02-07 21:38:03

I think escaping only occurs in Django during template rendering. Therefore, there's no need for an unescape - you just tell the templating engine not to escape. either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %}

Daniel 2009-02-08 01:03:39

@Daniel: Please change your comment to an answer so that I can vote it up! |safe was exactly what I (and I'm sure others) was looking for in answer to this question.

Wayne Koorts 2009-06-23 07:12:48

Ok, I'll modify the answer

Daniel 2009-06-26 16:26:54

Should be ''' instead of ''/'.

ionut bizau 2009-09-23 06:57:40

Answer 4

A:

Try regular expressions (like preg_replace in php).

Kirill Titov 2008-11-08 21:56:16

Regular expressions are overkill for simply replaces. str_replace while passing two arrays would be more sensible.

Stephen Caldwell 2008-11-08 21:58:37

Answer 5

+4 A:

Use daniel's solution if the set of encoded characters is relatively restricted. Otherwise, use one of the numerous HTML-parsing libraries.

I like BeautifulSoup because it can handle malformed XML/HTML :

http://www.crummy.com/software/BeautifulSoup/

for your question, there's an example in their documentation

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

vincent 2008-11-09 01:15:21

BeautifulSoup doesn't convert hex entities (e) http://stackoverflow.com/questions/57708/convert-xml-html-entities-into-unicode-string-in-python/57745#57745

J.F. Sebastian 2009-03-17 20:46:14

Answer 6

+4 A:

See at the bottom of this page at Python wiki, there are at least 2 options to "unescape" html.

zgoda 2008-11-23 13:50:40

Answer 7

+14 A:

For html encoding, there's cgi.escape from the standard library:

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

For html decoding, I use the following:

from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

For anything more complicated, I use BeautifulSoup.

2009-01-16 01:12:53

+1: for `htmlentitydefs`

J.F. Sebastian 2009-03-17 20:37:45

Answer 8

+5 A:

Daniel's comment as an answer:

"escaping only occurs in Django during template rendering. Therefore, there's no need for an unescape - you just tell the templating engine not to escape. either {{ context_var|safe }} or {% autoescape off %}{{ context_var }}{% endautoescape %}"

dfrankow 2009-10-24 22:04:16

Works, except that my version of Django does not have 'safe'. I use 'escape' instead. I assume it's the same thing.

willem 2009-12-28 11:23:58

Answer 9

+1 A:

I found a fine function at: http://snippets.dzone.com/posts/show/4569

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

slowkvant 2010-07-17 13:27:49

The benefit of using re is you can match both ' and ' using the same search.

Neal S. 2010-10-15 13:38:32

Answer 10

A:

slowkvant's solution was the only solution I could find that would unescape quotations but I don't have a high enough reputation to vote it up and apparently to comment on it. So I am posting this. Thanks slowkvant.

Derrick Petzold 2010-09-26 07:29:15

ansaurus

tags:

views:

answers:

How do I perform HTML decoding/encoding using Python/Django?

Related

related questions