views:

220

answers:

1

XML specification lists a bunch of Unicode characters that are either illegal or "discouraged". Now, given a string, what is the best way to remove all those illegal chars from it?

Right now, my best bet is a regular expression, but it's a bit of a mouthful:

illegal_xml_re = re.compile(u'[\x00-\x08\x0b-\x1f\x7f-\x84\x86-\x9f\ud800-\udfff\ufdd0-\ufddf\ufffe-\uffff]')

clean = illegal_xml_re.sub('', dirty)

(Python 2.5 doesn't even know about Unicode chars above 0xFFFF, so no need to filter those)

My question is: is this the best/proper way to do this?
Is there a more efficient or standard way?

UPDATE: Based on kaizer.se's comment, a more correct regular expression would have to be constructed on the fly, like this:

illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x1F), (0x7F, 0x84), (0x86, 0x9F),
                    (0xD800, 0xDFFF), (0xFDD0, 0xFDDF), (0xFFFE, 0xFFFF),
                    (0x1FFFE, 0x1FFFF), (0x2FFFE, 0x2FFFF), (0x3FFFE, 0x3FFFF),
                    (0x4FFFE, 0x4FFFF), (0x5FFFE, 0x5FFFF), (0x6FFFE, 0x6FFFF),
                    (0x7FFFE, 0x7FFFF), (0x8FFFE, 0x8FFFF), (0x9FFFE, 0x9FFFF),
                    (0xAFFFE, 0xAFFFF), (0xBFFFE, 0xBFFFF), (0xCFFFE, 0xCFFFF),
                    (0xDFFFE, 0xDFFFF), (0xEFFFE, 0xEFFFF), (0xFFFFE, 0xFFFFF),
                    (0x10FFFE, 0x10FFFF) ]

illegal_ranges = ["%s-%s" % (unichr(low), unichr(high)) 
                  for (low, high) in illegal_unichrs 
                  if low < sys.maxunicode]

illegal_xml_re = re.compile(u'[%s]' % u''.join(illegal_ranges))

I really wish someone could point me to a c implementation of this, perhaps in one of the many python xml libraries?

+1  A: 

You could also use unicode's translate method to delete selected codepoints. However, the mapping you have is pretty big (2128 codepoints) and that might make it much slower than just using a regex:

ranges = [(0, 8), (0xb, 0x1f), (0x7f, 0x84), (0x86, 0x9f), (0xd800, 0xdfff), (0xfdd0, 0xfddf), (0xfffe, 0xffff)]
# fromkeys creates  the wanted (codepoint -> None) mapping
nukemap = dict.fromkeys(r for start, end in ranges for r in range(start, end+1))
clean = dirty.translate(nukemap)
kaizer.se
After some testing, this seems to be much slower than a regexp, especially for large strings.
itsadok