tags:

views:

188

answers:

2

Edit: I hardcoded the charcter and use repsonse writer to write it, it still comes out to be K�nigsberger

response.setCharacterEncoding("UTF-8");

      response.setContentType(contentType);
      //if(contentType!=null)response.setHeader("Content-Type",contentType);
      Writer writer = response.getWriter();//new OutputStreamWriter(response.getOutputStream(),"UTF-8");
      System.err.println("character encoding is "+response.getCharacterEncoding());


      writer.write("Königsberger ");
      writer.flush();

Edit: I tried setContentType and setContentEncoding prior to calling getWriter(), still no difference in output:

     if(res.length()>0){
      //pw.write(res);
      response.setCharacterEncoding("UTF-8");
      response.setContentType(contentType);
      //if(contentType!=null)response.setHeader("Content-Type",contentType);
      Writer writer = response.getWriter();//new OutputStreamWriter(response.getOutputStream(),"UTF-8");
      System.err.println("character encoding is "+response.getCharacterEncoding());


      writer.write(res);
      writer.flush();
     }

I am reading some german characters then output them in xml from java servlet, here's how I read them in UTF8:

int len=0;
     byte[]buffer=new byte[1024];
     OutputStream os = sock.getOutputStream();
     InputStream is = sock.getInputStream();
     query += "\r\n";
     os.write(query.getBytes("UTF8"));//iso8859_1"));

      do{
       len = is.read(buffer);
             if (len>0) { 
                 if(outstring==null)outstring=new StringBuffer();
                 outstring.append(new String(buffer,0,len, "UTF8"));
             }
           }while(len>0);
System.out.println(outstring);

System.out outputs the string correctly: Königsberger

However when I repipe this string from my servletResponse also using charset=UTF-8 it becomes gobbled: K�nigsberger

private void outputResponse(String res, HttpServletRequest request,
      HttpServletResponse response) throws IOException {
     String outputFormat = getOutputFormat(request);
     String contentType=null;
     PrintWriter pw = response.getWriter();
     //response.setCharacterEncoding("UTF-8");
     System.err.println("output "+res);

     contentType= "text/xml; charset=UTF-8";
     res="<?xml version=\"1.0\" encoding=\"utf-8\"?>" + res;

     if(contentType!=null)response.setHeader("Content-Type",contentType);
     if(res.length()>0){
      pw.write(res);
     }
     pw.flush();

    }
+1  A: 
do{
  len = is.read(buffer);
  if (len>0) { 
    if(outstring==null) outstring=new StringBuffer();
    outstring.append(new String(buffer,0,len, "UTF8"));
  }
}while(len>0);

This is not a good way to decode UTF-8 as characters can become corrupted on buffer boundaries (details here). UTF-8 is a variable width encoding, so characters require between one and four bytes to store. If it is working, you are just getting lucky. It is better to encode and decode using the Reader/Writer classes (details here).

I believe you need to call either setContentType or setCharacterEncoding prior to calling getWriter. I don't think it is enough to call setHeader directly.


This servlet code will correctly encode and transmit the sample string as UTF-8 data:

  @Override
  protected void doGet(HttpServletRequest request, HttpServletResponse response)
      throws ServletException, IOException {
    response.setContentType("text/xml; charset=UTF-8");
    PrintWriter pw = response.getWriter();
    pw.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
    pw.write("<data>K\u00F6nigsberger</data>");
    pw.flush();
    pw.close();
  }

Note that I am using the escape sequence \u00F6 to emit the character U+00F6 (ö) to ensure that I do not corrupt the character in my text editor or during the compilation process (see here for more details).

Is it possible that the data is being misinterpreted on the client? Check the output with a hex editor.

Encoded as UTF-8, "K\u00F6nigsberger" should become the byte sequence:

4b c3 b6 6e 69 67 73 62 65 72 67 65 72

...where the character U+00F6 (ö) becomes c3 b6. You can use code like this to check your values:

  public static void main(String[] args) throws IOException {
    String konigsberger = "K\u00F6nigsberger";
    dumpHex(System.out, konigsberger.getBytes("UTF-8"));
  }

  private static void dumpHex(PrintStream out, byte[] data) {
    for (byte b : data) {
      out.format("%02x ", b);
    }
    out.println();
  }
McDowell
I suspect it's an xml formatting issue rather than unicode encoding.I used the code pw.write("<data>K\u00F6nigsberger</data>"); and when I view it in browser the character still gets corrupted...
A: 

You allways can use entities like this:

<test>
&#228;
&#252;
&#229;
</test>

to get:

<test>
ä
ü
å
</test>

Maybe not exactly what you want, but a nice workaround. You can use sites like utf8-chartable.de to look up the needed value.

Tim Büthe
this outputs the de characters fine, is there a way to convert these de characters into these xml codes?
I mean is there a java api to do the conversion directly in java?
@unknown (google): There is no Java API that will create these entities automagically. See this answer for an example of how to do it: http://stackoverflow.com/questions/1273986/converting-utf-8-to-iso-8859-1-in-java/1274830#1274830
McDowell
I'm not sure, if there is a library. What about commons lang StringEscapeUtils. There is a escapeXml Method, that looks promising: http://commons.apache.org/lang/api/org/apache/commons/lang/StringEscapeUtils.html#escapeXml(java.lang.String)
Tim Büthe