tags:

views:

55

answers:

3
+1  Q: 

reading unicode

I'm using java io to retrieve text from a server that might output character such as é. then output it using System.err, they turn out to be '?'. I am using UTF8 encoding. what's wrong? int len = 0;

char[] buffer = new char[1024];
OutputStream os = sock.getOutputStream();
InputStream is = sock.getInputStream();
os.write(query.getBytes("UTF8"));//iso8859_1"));

Reader reader = new InputStreamReader(is, Charset.forName("UTF-8"));
do {
    len = reader.read(buffer);
    if (len > 0) {
        if (outstring == null) {
            outstring = new StringBuffer();
        }
        outstring.append(buffer, 0, len);
    }
} while (len > 0);
System.err.println(outstring);

Edit: just tried the following code:

StringBuffer b = new StringBuffer();
for (char c = 'a'; c < 'd'; c++) {
    b.append(c);
}
b.append('\u00a5'); // Japanese Yen symbol
b.append('\u01FC'); // Roman AE with acute accent
b.append('\u0391'); // GREEK Capital Alpha
b.append('\u03A9'); // GREEK Capital Omega

for (int i = 0; i < b.length(); i++) {
    System.out.println("Character #" + i + " is " + b.charAt(i));
}
System.out.println("Accumulated characters are " + b);

came out to be junk as well:

Character #0 is a
Character #1 is b
Character #2 is c
Character #3 is ¥
Character #4 is ?
Character #5 is ?
Character #6 is ?
Accumulated characters are abc¥???
A: 

write this to a file and check how it is coming. if it is coming properly in file then it is problem with your error stream ( Encoding is not UTF-8) . if there also it comes as junk character in ur server encoding may not be UTF-8.

sreejith
the file came out the same, but other reference program reads and display the unicode character just fine(i don't have source code to that program)
I changed the encoding to UTF-8 in eclipse and run the newly added code it is coming properly... please check in that way.
sreejith
+1  A: 

First, verify that the system property (file.encoding) is, in fact UTF8. If it is then your problem isn't the code you're running but your terminal program (or other output display) being unable to properly render the output.

caskey
A: 

Your second example produces the following output for me.

Character #0 is a
Character #1 is b
Character #2 is c
Character #3 is ¥
Character #4 is Ǽ
Character #5 is Α
Character #6 is Ω
Accumulated characters are abc¥ǼΑΩ

This code produces a correctly encoded UTF-8 file having the same content.

StringBuilder b = new StringBuilder();
for (char c = 'a'; c < 'd'; c++) {
    b.append(c);
}
b.append('\u00a5'); // Japanese Yen symbol
b.append('\u01FC'); // Roman AE with acute accent
b.append('\u0391'); // GREEK Capital Alpha
b.append('\u03A9'); // GREEK Capital Omega

PrintStream out = new PrintStream("temp.txt", "UTF-8");
for (int i = 0; i < b.length(); i++) {
    out.println("Character #" + i + " is " + b.charAt(i));
}
out.println("Accumulated characters are " + b);

See also: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

trashgod