ansaurus

Question

Fast way to filter illegal xml unicode chars in python?

Answer 1

+1 A:

You could also use unicode's translate method to delete selected codepoints. However, the mapping you have is pretty big (2128 codepoints) and that might make it much slower than just using a regex:

ranges = [(0, 8), (0xb, 0x1f), (0x7f, 0x84), (0x86, 0x9f), (0xd800, 0xdfff), (0xfdd0, 0xfddf), (0xfffe, 0xffff)]
# fromkeys creates  the wanted (codepoint -> None) mapping
nukemap = dict.fromkeys(r for start, end in ranges for r in range(start, end+1))
clean = dirty.translate(nukemap)

kaizer.se 2009-11-10 13:30:43

After some testing, this seems to be much slower than a regexp, especially for large strings.

itsadok 2009-11-10 14:57:47

ansaurus

tags:

views:

answers:

Fast way to filter illegal xml unicode chars in python?

related questions