views:

27

answers:

1

I'm working on filter that should transform an output with some stylesheet. Important sections of code looks like this:

PrintWriter out = response.getWriter();
...
StringReader sr = new StringReader(content);
Source xmlSource = new StreamSource(sr, requestSystemId);
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setParameter("encoding", "UTF-8");
//same result when using ByteArrayOutputStream xo = new java.io.ByteArrayOutputStream();
StringWriter xo = new StringWriter();
StreamResult result = new StreamResult(xo);
transformer.transform(xmlSource, result);
out.write(xo.toString());

The problem is that national characters are encoded as html entities and not by using UTF. Is there any way to force transformer to use UTF-8 instead of entities?

+1  A: 

You need to set the output method to text instead of (default) xml.

transformer.setOutputProperty(OutputKeys.METHOD, "text");

You should however also set the response encoding beforehand:

response.setCharacterEncoding("UTF-8");

And instruct the webbrowser to use the same encoding:

response.setContentType("text/html;charset=UTF-8");
BalusC
"text" or "xml" methods produces unknown characters in the place of entities which is displayed as � (question mark) in the browser. Those question marks are not interpreted correctly for whichever page encoding I choose in a browser. Strange.
calavera.info
Then you need to set the response encoding and the HTTP `Content-Type` to use the same character encoding `UTF-8`. The first will write the chars in the desired encoding and the second will instruct the webbrowser which encoding to use. Also see [Unicode - How to get characters right?](http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html).
BalusC
As I wrote in the comment those question marks are not intepreted correctly independenly of encoding I choose. I tried UTF-8 first, of course, and some other national 1 byte encodings to be sure. It's definitely not utf. That's why I wrote it's strange. It seems to me that transformer is trying to avoid utf as much as possible so it had chosen some obscure 1 byte encoding to write those national characters.
calavera.info
I updated the answer with my comment translated in real code. Hopefully this makes more clear what I mean.
BalusC
Yes, you are right. After changing the method the problem wasn't in transformer but in out.write(). When I tried it I set character encoding on response AFTER I called getWriter() on it and that convinced me it's not the case, sorry, my mistake, thank you very much.
calavera.info
You're welcome.
BalusC