views:

153

answers:

2

I have the following DOM

    <row>
        <link href="B&#252;ro.txt" target="_blank">
            my link
        </link>
    </row>

When I serialize it to a file using the Java XmlSerializer it comes out like this:

    <row>
        <link href="B&amp;#252;ro.txt" target="_blank">
            my link
        </link>
    </row>

Is there any way to control the way XmlSerializer handles escaping in attributes? Should I be doing this differently any way?

Update

I should also say that I am using jre 1.6. I had been using jre 1.5 until recently and I am pretty sure that it was serialized 'correctly' (i.e. the '&' was not escaped)

Clarification

The DOM is created programmatically. Here is an example:

        Document doc = createDocument();
        Element root = doc.createElement("root");
        doc.appendChild(root);
        root.setAttribute("test1", "&#234;");
        root.setAttribute("test2", "üöä");
        root.appendChild(doc.createTextNode("&#234;"));

        StringWriter sw = new StringWriter();

        serializeDocument(doc, sw);
        System.out.println(sw.toString());

My solution I didn't really want to do this because it involved a fair amount of code change and testing but I decided to move the attribute data into a CDATA element. Problem solved avoided.

+2  A: 

How do you obtain the DOM? Could it have something to do with that? I tried your sample XML with the standard DocumentBuilder (just b/c I'm more familiar with it) using Sun Java 6 and the latest Xerces-J (2.9.1) which by the way deprecates XmlSerializer in favor of LSSerializer or TrAX.

Anyway, using this technique, the serialized document does not even contain the character reference anymore and gets converted to "Büro.txt". I used the following code:

String xml = "<row>\n"
    + "        <link href=\"B&#252;ro.txt\" target=\"_blank\">\n"
    + "            my link\n" + "        </link>\n" + "    </row>";

InputStream is = new ByteArrayInputStream(xml.getBytes());
Document doc = DocumentBuilderFactory.newInstance()
    .newDocumentBuilder().parse(is);

XMLSerializer xs = new XMLSerializer();
xs.setOutputCharStream(new PrintWriter(System.err));

xs.serialize(doc);
musiKk
Thanks +1. The DOM is created programmatically (appendChild etc). I'll add a clarification to the question. Just discovered LSSerializier so I'll look into that.
paul
Okay, let's see. Maybe someone else knows a better solution but I suspect it's impossible (at least in a clean way) to create character references that way because the data is handled as such and not XML instructions. Could be wrong though...Since both XML and Java are Unicode aware this might not be too bad.
musiKk
+1  A: 

The problem is that you are building the DOM with attribute values that have already been "escaped" according to the XML conventions. The DOM (of course) doesn't realize that you have done this and is escaping the ampersand.

You should change

root.setAttribute("test1", "&#234;");

to

root.setAttribute("test1", "\u00EA");

In other words, use strings consisting of plain Unicode codepoints when constructing the DOM. The XMLSerializer should then replace Unicode characters with character entities as required ... depending on the chosen character encoding for the output document.

EDIT - The reason that you may still be seeing raw characters rather than character entities in the ouput XML is that the XMLSerializer is using the default encoding for XML; i.e. UTF-8. The way to address this is use the XMLSerializer(OutputFormat) constructor, passing an OutputFormat that specifies the required character encoding for the XML. (It sounds like you are using "ASCII".) Be sure to use to compatible character encoding for the OutputStream.

Stephen C
+1 sounds very reasonable. However, I tried it and the '\u00EA' remains unprocessed. I am putting the attribute value in the href attribute of an anchor tag e.g. <a href="\u00. It seems that IE cannot cope with '\u00EA' and therefore cannot find the document.
paul
The \u00EA is a Java unicode escape. If it somehow appears in the output in that form ... you must be including it in input data rather than as a Java character or string literal.
Stephen C