views:

175

answers:

0

I'm trying to parse an XML document using MSXML v4 in C++, using my own entity resolver to re-direct the parser at local DTDs on my own hard-drive, rather than allowing the parser to go online to get the DTDs from their locations as specified in the XML file being parsed. I've managed to get this working with Xerces, but the behaviour I'm experiencing in MSXML seems somewhat bizarre.

The XML document I'm reading is completely valid, so I don't expect any errors to be reported by the parser. Indeed, this is the case if I let the parser go online to get the DTD files by leaving the pvarInput VARIANT pointer NULL in the resolveEntity(...) callback. However, as soon as I try to supply the parser with the text of the identical DTDs/MODs sourced from my local disk, I get the following error: "incorrect document syntax", which, apparently, occurs on the first line of the XML file. This doesn't happen for all of the DTD files though; in the case I'm trying to debug, the first DTD it asks for works with no problems, but I get the error as soon as it tries to use the second MOD file it asks for.

I'm new to COM, so it's quite possible I'm doing something pretty stupid which is why this isn't working. Essentially, what I'm doing (using a simplified pseudo String class) is this:

HRESULT __stdcall resolveEntity(unsigned short* pwchPublicId, unsigned short* pwchSystemId, VARIANT* pvarInput)
{
    // Get the file name without its path
    String systemId = pwchSystemId;
    const int idx = systemId.FindLastChar(L('/'));

    String fileName = systemId;
    if (idx > -1) {
     fileName = systemId.SubString(idx + 1);
    }

    // All the DTDs/MODs are in UTF-8 format, so load the file in memory and convert it to a unicode string
    String fileContent = LoadFileAsUTF8ConvertToUnicode(fileName);

    CComBSTR data(fileContent);
    data.CopyTo(pvarInput);
    data.Detach(); // Unsure of ownership semantics, so this might not be necessary
}

The fact that this works with some files but not with others is particularly baffling. I've made sure that the "fileContent" variable contains valid unicode content for all the affected files, so it's nothing to do with any bugs that might exist in my UTF8 conversion code. It's definitely something in the last three lines of code that MSXML is taking offence to, but I can't work out what it is!

Any help at all in respect of resolving entities in MSXML would be greatly appreciated. I can find very little about the subject at all online.