views:

312

answers:

4

Some character not support by certain charset, so below test fail. I would like to use html entity to encode ONLY those not supported character. How, in java?

public void testWriter() throws IOException{
    String c = "\u00A9";
    String encoding = "gb2312";
    ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    Writer writer  = new BufferedWriter(new OutputStreamWriter(outStream, encoding));
    writer.write(c);
    writer.close();
    String result = new String(outStream.toByteArray(), encoding);
    assertEquals(c, result);
}
A: 

Try using StringEscapeUtils from apache commons.

Tom
StringEscapeUtils escape everything non-ASCII (not only what cannot be encoded).
Thilo
+3  A: 

I'm not positive I understand the question, but something like this might help:

import java.nio.charset.CharsetEncoder;

...

  StringBuilder buf = new StringBuilder(c.length());
  CharsetEncoder enc = Charset.forName("gb2312");
  for (int idx = 0; idx < c.length(); ++idx) {
    char ch = c.charAt(idx);
    if (enc.canEncode(ch))
      buf.append(ch);
    else {
      buf.append("&#");
      buf.append((int) ch);
      buf.append(';');
    }
  }
  String result = buf.toString();

This code is not robust, because it doesn't handle characters beyond the Basic Multilingual Plane. But iterating over code points in the String, and using the canEncode(CharSequence) method of the CharsetEncoder, you should be able to handle any character.

erickson
Thank you. I believe this canEncode() form CharsetEncoder is what I am looking for.
Kevin Yu
A: 

Just use utf-8, and that way there is no reason to use entities. If there is an argument that some clients need gb2312 because they don't understand Unicode, then entities are not much use either, because the numeric entities represent Unicode code points.

Mihai Nita
A: 

Thanks for answers above. I could not find a place to choose the right answer. I agree sylvarking more. Thanks.

Kevin Yu