ansaurus

Question

Converting html entities into their values in python

Answer 1

A:

Without knowing what the expression is being used for I can't tell exactly what you need.

This will match special characters or strings of characters excluding letters, digits, @, and #:

[^a-zA-Z0-9@#]*|#[0-9A-Za-z]+;

Trey 2010-05-02 23:45:30

Answer 2

A:

You can adapt the following script:

import htmlentitydefs
import re

def substitute_entity (match):
    name = match.group (1)
    if name in htmlentitydefs.name2codepoint:
        return unichr (htmlentitydefs.name2codepoint[name])
    elif name.startswith ('#'):
        try:
            return unichr (int (name[1:]))
        except:
            pass

    return '?'

print re.sub ('&(#?\\w+);', substitute_entity, 'x &laquo; y &wat; z &#123;')

Produces the following answer here:

x « y ? z {

EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

doublep 2010-05-02 23:46:32

Answer 3

+2 A:

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:

import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))

s = '&#227;, &#1606;, &#1588;'
u = xed_re.sub(usub, s)

if your terminal emulator can display arbitrary unicode glyphs, a print u will then show

ã, ن, ش

In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).

If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

Alex Martelli 2010-05-03 00:03:35

ansaurus

tags:

views:

answers:

Converting html entities into their values in python

related questions