tags:

views:

694

answers:

2

I am having trouble transforming particular characters from an XML feed into XHTML.

I am using the following example to demonstrate the problem.

Here is my XML file:

<?xml version="1.0" encoding="UTF-8"?>
<paragraph>some text including the –, ã and ’ characters</paragraph>

Here is the XSLT I am applying:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
<xsl:output method="html" 
            encoding="UTF-8" 
            indent="yes"
            doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
            doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" />
    <xsl:template match="paragraph">
    <html xmlns="http://www.w3.org/1999/xhtml"&gt;
            <head></head>
            <body>
     <p><xsl:apply-templates/></p>
            </body>
        </html>
</xsl:template>
</xsl:stylesheet>

Here is the resultant XHTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html>
    <head></head>
    <body>
    <p>some text including the –, ã and ’ characters</p>
    </body>
</html>

The characters from the original XML are being replaced with new ones.

Firstly I want to check whether there is something wrong with my encoding which causes this issue?

Am I supposed to do something using entities if I want to map the special characters to display correctly in XHTML? If so how do I use these within an XSLT and do I need to know every single possible value that could be in my XML feed in advance?

A: 

It may sound stupid but are you sure the xml file is actually utf-8? It's one thing to put it in the prolog but the file itself could be using another encoding.

Julian Aubourg
I was using XMLSpy to create the file and I believe that uses UTF-8 as standard. I've even re-created it in Notepad saving specifically as UTF-8 to make sure.
tentonipete
And the output file? Maybe the xsl tool you are using is at fault here.
Julian Aubourg
+4  A: 

I agree the kdgregory: The output file looks to be in UTF-8, but its reader thinks it is in something else (ISO-8859-1 or CP-1252 (the default for Windows)). Try adding a content type directly in the HTML head element:

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>

and see if that helps.

Kathy Van Stone
This makes the file render correctly in the browser, thanks. This would also explain why it displayed correctly in some browsers but not others.
tentonipete