views:

267

answers:

2

I’m trying to implement a SOAP webservice in Python 2.6 using the suds library. That is working well, but I’ve run into a problem when trying to parse the output with lxml.

Suds returns a suds.sax.text.Text object with the reply from the SOAP service. The suds.sax.text.Text class is a subclass of the Python built-in Unicode class. In essence, it would be comparable with this Python statement:

u'<?xml version="1.0" encoding="utf-8" ?><root><lotsofelements \></root>'

Which is incongrous, since if the XML declaration is correct, the contents are UTF-8 encoded, and thus not a Python Unicode object (because those are stored in some internal encoding like UCS4).

lxml will refuse to parse this, as documented, since there is no clear answer to what encoding it should be interpreted as.

As I see it, there are two ways out of this bind:

  1. Strip the <?xml> declaration, including the encoding.
  2. Convert the output from Suds into a bytestring, using the specified encoding.

Currently, the data I’m receiving from the webservice is within the ASCII-range, so either way will work, but both feels very much like ugly hacks to me, and I’m not quite sure what would happen, if I start to receive data that would need a wider range of Unicode characters.

Any good ideas? I can’t imagine I’m the first one in this position…

+1  A: 

Hmm, I'm currently implementing my first Suds-based solution and parsing my responses with lxml without a problem, but I think this could be because I'm doing it in a pretty blunt and dumb way. Here's what my code looks like:

try:
    result = self.client.service.ExportOwnersDetails(fAccess=self.access_id, fParams=params)
except URLError:
    # TODO: Log timeout here, handle
    return
response = str(result.fReturn)

if len(response) == 0 or response.find('<?xml ') == -1:
    # TODO: Log import error here, handle
    return
response = StringIO(response)
xml = etree.parse(response)

Like I said, not very clever (and obviously I still have some logging to do), but that's my approach. The fAccess, fParams, fReturn nonsense is the naming convention at the third-party provider I'm integrating with.

Tom
Well, you could use `etree.fromstring(response)` instead of having to convert to a StringIO first (etree.parse() is for reading files, etree.fromstring() happily accepts strings).But the StringIO conversion may be the reason you’re not seeing the same errors I do…
mikl
Duh, I knew I'd been away from lxml too long. fromstring() worked fine for me. Thanks for asking a question so you could clean up my code.
Tom
+1  A: 

You and lxml are correct; a valid XML document must be a stream of bytes encoded as declared in the <?xml ..... header (default: UTF-8).

I'd suggest a third option: leave it in unicode with an XML header that omits the encoding declaration but leaves the version in there (future-safe). That will keep lxml happy and avoid the overhead of you encoding it again.

I'd also suggest some gentle enquiry at the suds site and having a poke around in their source.

John Machin
I suppose simply removing the encoding part is a reasonable way to go about it, thanks.I think I’ll hit up the suds guys to see if a fix for this edge case is worthy of inclusion into the main library.
mikl