tags:

views:

136

answers:

1

I am working on producing an xml document from python. We are using the xml.dom package to create the xml document. We are having a problem where we want to produce the character φ which is a φ. However, when we put that string in a text node and call toxml() on it we get φ. Our current solution is to use saxutils.unescape() on the result of toxml() but this is not ideal because we will have to parse the xml twice.

Is there someway to get the dom package to recognize "φ" as an xml character?

+1  A: 

I think you need to use a Unicode string with \u03c6 in it, because the .data field of a text node is supposed (as far as I understand) to be "parsed" data, not including XML entities (whence the & when made back into XML). If you want to ensure that, on output, non-ascii characters are expressed as entities, you could do:

import codecs

def ent_replace(exc):
  if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
    s = []
    for c in exc.object[exc.start:exc.end]:
      s.append(u'&#x%4.4x;' % ord(c))
    return (''.join(s), exc.end)
  else:
    raise TypeError("can't handle %s" % exc.__name__)

codecs.register_error('ent_replace', ent_replace)

and use x.toxml().encode('ascii', 'ent_replace').

Alex Martelli