Saxon and character encoding: experiences and errors

Recently I ran some tests on xslt transformations with Saxon. My main focus was file encoding and character sets. But I was interested also in impact of different Saxon versions and Java VM x86 vs. x64. The insights are not spectacular still I'd like to share them and ask for comments.

On xml file encoding: In general, you have to distinguish between the encoding defined in xml declaration like <?xml version="1.0" encoding="ISO-8859-1"?> and the actual encoding of the file. Of course, they should match. To determine or change the actual encoding I found editor Notepad++ (http://notepad-plus-plus.org) very handy (menu Format).

xslt transformations worked as expected if the stylesheet (xsl) file encoding matched the one from its xml declaration. That means, stylesheets in UTF8 with declaration 'UTF-8' have been processed successfully like those in ANSI with declaration 'ISO-8859-1'.

However, a stylesheet saved in ANSI with declaration 'UTF-8' causes error:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. Failed to compile stylesheet. 1 error detected.

And a stylesheet saved in UCS2 with declaration 'UTF-8' causes error:

Error on line 1 column 40 of file:/C:/Temp/test.xsl: SXXP0003: Error reported by XML parser: Content is not allowed in prolog. Failed to compile stylesheet. 1 error detected.

Also in the input xml the encoding declaration should match it's actual encoding. But mismatches here did not stop Saxon from processing for me. That affects just the formatting of characters in the generated output xml.

I executed all tests under Windows 7 x64 both with x64 VM and x86 VM (SUN JRE 1.6u22). No difference here. Moreover, I tried three Saxon versions and didn't find any difference between them: saxonb8-8j saxonb9-1-0-8j saxonhe9-2-1-2j

ansaurus

tags:

views:

answers:

Saxon and character encoding: experiences and errors

related questions