views:

24

answers:

1

We have a process in which XML is transferred to us via ESMTP in an email body. The character set of the email body is specified as ISO-8859-1, and no encoding is specified for the XML. According to the protocol, the default is UTF-8.

The problem is our XML parser is throwing an exception when it encounters the ® character because it thinks it's parsing UTF-8, and the ® character in UTF-8 is 2 bytes, not 1 as in ISO-8859-1.

  1. Should we assume that the body is ISO-8859-1 and thus override the XML encoding (UTF-8)?
  2. More subjectively, is the email being sent incorrectly, and would it be better for us to try to interpret as UTF-8 on our side or ask whoever is sending it to correctly and consistently specify the encoding?

Here is a sample email body with XML:

Delivered-To: ...
Received: ...
Received: ...
Return-Path: ...
Received: ...
Received-SPF: ...
Authentication-Results: ...
Received: ...
Thread-Topic: ...
From: ...
To: ...
Subject: ...
Date: ...
Message-ID: ...
MIME-Version: 1.0
Content-Type: text/plain;
 charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-Mailer: Microsoft CDO for Windows 2000
Content-Class: urn:content-classes:message
Importance: normal
Priority: normal
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4325

<?xml version="1.0"?>
...
   <comments>Super Widget®</comments>
...
+1  A: 

The XML specification says in appendix F, concerning encoding detection:

Also, in many cases other sources of information are available in addition to the XML data stream ifself.

So yes, in lack of an encoding="..." in the XML stream itself, you should rely on the external source, which in this case is the Content-Type header.

Roland Illig