We have a process in which XML is transferred to us via ESMTP in an email body. The character set of the email body is specified as ISO-8859-1, and no encoding is specified for the XML. According to the protocol, the default is UTF-8.
The problem is our XML parser is throwing an exception when it encounters the ® character because it thinks it's parsing UTF-8, and the ® character in UTF-8 is 2 bytes, not 1 as in ISO-8859-1.
- Should we assume that the body is ISO-8859-1 and thus override the XML encoding (UTF-8)?
- More subjectively, is the email being sent incorrectly, and would it be better for us to try to interpret as UTF-8 on our side or ask whoever is sending it to correctly and consistently specify the encoding?
Here is a sample email body with XML:
Delivered-To: ...
Received: ...
Received: ...
Return-Path: ...
Received: ...
Received-SPF: ...
Authentication-Results: ...
Received: ...
Thread-Topic: ...
From: ...
To: ...
Subject: ...
Date: ...
Message-ID: ...
MIME-Version: 1.0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 8bit
X-Mailer: Microsoft CDO for Windows 2000
Content-Class: urn:content-classes:message
Importance: normal
Priority: normal
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.4325
<?xml version="1.0"?>
...
<comments>Super Widget®</comments>
...