views:

127

answers:

4

If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:

String s1 = new String("#");
int val = s1.codePointAt(0);

This returns 35 which is the correct value.

But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:

String s1 = new String("ƒ")  // Latin small letter f with hook
int val = s1.codePointAt(0);

This should return 159 as per this reference table, but instead returns 409, why is this?

+4  A: 

But if I try use ASCII characters from 128 to 255

ASCII doesn't have values in this range. It only uses 7 bits.

Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.

The codePointAt method returns the 32-bit codepoint. UTF-16 can't contain the entire Unicode range, so some code points must be split across two chars. The codePointAt method helps resolve to chars code points.

I wrote a rough guide to encoding in Java here.

McDowell
+2  A: 

Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).

To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.

Update

ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.

josefx
'ƒ' is not represented in ISO-8859-1, so the result of getBytes is undefined. On some implementations it will only return the bytes for '?'.
Michael Konietzka
@Michael Konietzka good to know, I didn't check the encoding.
josefx
A: 

in Unicode

ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK
irreputable
A: 

String.codePointAt returns the Unicode-Codepoint at this specified index.

The Unicode-Codepoint of ƒ is 402, see

http://www.decodeunicode.org/de/u+0192/properties

So

System.out.println("ƒ".codePointAt(0));

printing 402 is correct.

If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):

    final String s = "ƒ";
    for (final String csName : Charset.availableCharsets().keySet()) {
    try {
     final Charset cs = Charset.forName(csName);
     final CharsetEncoder encode = cs.newEncoder();
     if (encode.canEncode(s)) 
          {
      System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
          }
        } catch (final UnsupportedOperationException uoe) {
        } catch (final UnsupportedEncodingException e) {
     }
    }
Michael Konietzka