views:

580

answers:

2

I'm trying to load a simple Xml file (encoded in UTF-8):

<?xml version="1.0" encoding="UTF-8"?>
<Test/>

And save it with MSXML in vbscript:

Set xmlDoc = CreateObject("MSXML2.DOMDocument.6.0")

xmlDoc.Load("C:\test.xml")

xmlDoc.Save "C:\test.xml" 

The problem is, MSXML saves file in ANSI instead of UTF-8 (despite the original file being encoded in UTF-8).

The MSDN docs for MSXML says that save() will write the file in whatever encoding the XML is defined in:

Character encoding is based on the encoding attribute in the XML declaration, such as . When no encoding attribute is specified, the default setting is UTF-8.

But this is clearly not working at least on my machine.

How can MSXML save in UTF-8?

+1  A: 

There isn't any non-ANSI text in your XML file, so it will be identical whether UTF-8 or ASCII encoded. In my tests, after adding non-ASCII text to test.xml, MSXML always saves in UTF-8 encoding and also writes the BOM if there was one to begin with.

http://en.wikipedia.org/wiki/UTF-8
http://en.wikipedia.org/wiki/Byte_order_mark

Kyle Alons
And so I guess there's no way MSXML can save in UTF-8 if there's no unicode bytes in the file?
stung
By definition, there is no difference between an ASCII and UTF-8 file if it contains only ASCII characters (except for the BOM if included)...
Kyle Alons
A: 

You use two other classes in MSXML to write out XML properly encoded to an output stream.

Here is my helper method that writes to a generic IStream:

class procedure TXMLHelper.WriteDocumentToStream(const Document60: IXMLDOMDocument2; const stream: IStream; Encoding: string = 'UTF-8');
var
    writer: IMXWriter;
    reader: IVBSAXXMLReader;
begin
{
    From http://support.microsoft.com/kb/275883
    INFO: XML Encoding and DOM Interface Methods

    MSXML has native support for the following encodings:
        UTF-8
        UTF-16
        UCS-2
        UCS-4
        ISO-10646-UCS-2
        UNICODE-1-1-UTF-8
        UNICODE-2-0-UTF-16
        UNICODE-2-0-UTF-8

    It also recognizes (internally using the WideCharToMultibyte API function for mappings) the following encodings:
        US-ASCII
        ISO-8859-1
        ISO-8859-2
        ISO-8859-3
        ISO-8859-4
        ISO-8859-5
        ISO-8859-6
        ISO-8859-7
        ISO-8859-8
        ISO-8859-9
        WINDOWS-1250
        WINDOWS-1251
        WINDOWS-1252
        WINDOWS-1253
        WINDOWS-1254
        WINDOWS-1255
        WINDOWS-1256
        WINDOWS-1257
        WINDOWS-1258
}

    if Document60 = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: Document60 cannot be nil');
    if stream = nil then
        raise Exception.Create('TXMLHelper.WriteDocument: stream cannot be nil');

    // Set properties on the XML writer - including BOM, XML declaration and encoding
    writer := CoMXXMLWriter60.Create;
    writer.byteOrderMark := True; //Determines whether to write the Byte Order Mark (BOM). The byteOrderMark property has no effect for BSTR or DOM output. (Default True)
    writer.omitXMLDeclaration := False; //Forces the IMXWriter to skip the XML declaration. Useful for creating document fragments. (Default False)
    writer.encoding := Encoding; //Sets and gets encoding for the output. (Default "UTF-16")
    writer.indent := True; //Sets whether to indent output. (Default False)
    writer.standalone := True;

    // Set the XML writer to the SAX content handler.
    reader := CoSAXXMLReader60.Create;
    reader.contentHandler := writer as IVBSAXContentHandler;
    reader.dtdHandler := writer as IVBSAXDTDHandler;
    reader.errorHandler := writer as IVBSAXErrorHandler;
    reader.putProperty('http://xml.org/sax/properties/lexical-handler', writer);
    reader.putProperty('http://xml.org/sax/properties/declaration-handler', writer);


    writer.output := stream; //The resulting document will be written into the provided IStream

    // Now pass the DOM through the SAX handler, and it will call the writer
    reader.parse(Document60);

    writer.flush;
end;

In order to save to a file i call the Stream version with a FileStream:

class procedure TXMLHelper.WriteDocumentToFile(const Document60: IXMLDOMDocument2; const filename: string; Encoding: string='UTF-8');
var
    fs: TFileStream;
begin
    fs := TFileStream.Create(filename, fmCreate or fmShareDenyWrite);
    try
        TXMLHelper.WriteDocumentToStream(Document60, fs, Encoding);
    finally
        fs.Free;
    end;
end;

You can convert the functions to whatever language you like. These are Delphi.

Ian Boyd