ansaurus

Question

Answer 1

+1 A:

do{
  len = is.read(buffer);
  if (len>0) { 
    if(outstring==null) outstring=new StringBuffer();
    outstring.append(new String(buffer,0,len, "UTF8"));
  }
}while(len>0);

This is not a good way to decode UTF-8 as characters can become corrupted on buffer boundaries (details here). UTF-8 is a variable width encoding, so characters require between one and four bytes to store. If it is working, you are just getting lucky. It is better to encode and decode using the Reader/Writer classes (details here).

I believe you need to call either setContentType or setCharacterEncoding prior to calling getWriter. I don't think it is enough to call setHeader directly.

This servlet code will correctly encode and transmit the sample string as UTF-8 data:

  @Override
  protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
    response.setContentType("text/xml; charset=UTF-8");
    PrintWriter pw = response.getWriter();
    pw.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
    pw.write("<data>K\u00F6nigsberger</data>");
    pw.flush();
    pw.close();
  }

Note that I am using the escape sequence \u00F6 to emit the character U+00F6 (ö) to ensure that I do not corrupt the character in my text editor or during the compilation process (see here for more details).

Is it possible that the data is being misinterpreted on the client? Check the output with a hex editor.

Encoded as UTF-8, "K\u00F6nigsberger" should become the byte sequence:

4b c3 b6 6e 69 67 73 62 65 72 67 65 72

...where the character U+00F6 (ö) becomes c3 b6. You can use code like this to check your values:

  public static void main(String[] args) throws IOException {
    String konigsberger = "K\u00F6nigsberger";
    dumpHex(System.out, konigsberger.getBytes("UTF-8"));
  }

  private static void dumpHex(PrintStream out, byte[] data) {
    for (byte b : data) {
      out.format("%02x ", b);
    }
    out.println();
  }

McDowell 2009-11-01 19:07:26

I suspect it's an xml formatting issue rather than unicode encoding.I used the code pw.write("<data>K\u00F6nigsberger</data>"); and when I view it in browser the character still gets corrupted...

2009-11-06 00:31:51

Answer 2

A:

You allways can use entities like this:

<test>
&#228;
&#252;
&#229;
</test>

to get:

<test>
ä
ü
å
</test>

Maybe not exactly what you want, but a nice workaround. You can use sites like utf8-chartable.de to look up the needed value.

Tim Büthe 2009-11-02 14:54:33

this outputs the de characters fine, is there a way to convert these de characters into these xml codes?

2009-11-06 00:44:03

I mean is there a java api to do the conversion directly in java?

2009-11-06 00:44:43

@unknown (google): There is no Java API that will create these entities automagically. See this answer for an example of how to do it: http://stackoverflow.com/questions/1273986/converting-utf-8-to-iso-8859-1-in-java/1274830#1274830

McDowell 2009-11-06 10:24:00

I'm not sure, if there is a library. What about commons lang StringEscapeUtils. There is a escapeXml Method, that looks promising: http://commons.apache.org/lang/api/org/apache/commons/lang/StringEscapeUtils.html#escapeXml(java.lang.String)

Tim Büthe 2009-11-09 10:32:13

ansaurus

tags:

views:

answers:

nolatin characters in xml output

related questions