views:

534

answers:

2

I'm attempting to apply a stylesheet to an XML document using Saxon. Given an XML file that was generated in Microsoft Word and that has Microsoft Word-style quotes, such as around FOO in the following document

<?xml version="1.0" encoding="UTF-8"?>
<doc>
    <act>
     <performer typeCode=“FOO“ />
     <performer typeCode="BAR" />
    </act>
</doc>

Saxon throws the following error:

SXXP0003: Error reported by XML parser: Invalid byte 1 of 1-byte UTF-8 sequence.

What is the best way to handle these type of "special" characters in XML that were intended to be valid but break in actual parsing/transformation?

+1  A: 

Since the above is not valid XML, you will have to do some preprocessing of the input (say with a FilterReader), as just about any XML parser will indicate an error (and typically a fatal error, so you cannot handle the error and continue).

If the special quotes are only in the xml you can do a simple replace of the special quotes with plain quotes (a little more work if you have to check the preamble for the encoding type). If you want to keep special quotes elsewhere in the document you will have to do something a bit more complicated (mostly keep track as to whether you are in a tag or not).

Kathy Van Stone
A: 

trouble is those 'special' quotes are not valid xml. Saxon or any other xml parser is going to throw that stuff out and not parse the document.

Only thing I can suggest is a search and replace for those and replace them with the expected quotes.

Gareth Davis