views:

736

answers:

3

Man, this character encoding hole just keeps on getting deeper. Sigh. Ok. Check this out: I have a java String that contains the unicode character U+9996 (that's what I get if I do codePointAt()). If I look at it in the debugger expressions panel (in eclipse) then all is well and it looks like "首". However if I print it out to the console I get simply "?". It doesn't seem to be the font that's the problem as I've tried setting that differently.

My real problem is that I'm trying to put the string into a MySQL database (with utf8 encoding). Lots of other wide characters show up fine in the db but, again, this one and some others like it show up as "?". All of which leads me to believe that the problem is on the java side.

In chasing down this bug I've learnt a little about Unicode Normalization and java.text.Normalizer which looks like it might be relevant in this case. I've learnt that U+9996 is the canonical version of U+2FB8. U+2FB8 has exactly the same problems above though as regards display and anyway why would I want to transform to a non-canonical representation (even if I could, which I don't think I can)?

Anyway, there's one potential clue I've found which I've been unable to comprehend. This page contains the words "U+9996 is not a valid unicode character" with no further explanation. It then proceeds to show how to encode this supposedly non-valid unicode character in various unicode encodings. So my question is this basically: WTF?


UPDATES

  • I'm on a Mac.
  • I'm talking about the Eclipse console.
    • I set the console encoding to UTF-8 under Run > Common
    • I added -Dfile.encoding=UTF-8 to the JVM arguments (the default was MacRoman)
    • The console (Eclipse and Terminal.app) now show the right chars. Hooray!
  • I'm mostly interested in the data getting into the database correctly though of course I'd like to get a total understanding of what's going on here.
  • I think I've fixed the database problem. I forgot to set the encoding on the connection. Now I don't understand why some asian characters were getting through and not others.
  • Phew, stackoverflow moves fast. It's hard to keep up. Thanks people.
A: 

I don't know about the problems, but it's definitely a valid Unicode character (and has been since Unicode 1.1).

Joachim Sauer
A: 
  1. What O/S is this running on?
  2. What console application is ie (xterm, cmd.exe, etc?)
  3. Is the console application set for UTF-8 output?

Regarding 3 above, which is probably the important one, I've seen similar issues using e.g. PuTTY to talk to a Linux box, where the Linux box thought I was on UTF-8, but the PuTTY session itself was set to ISO-Latin-1 (8859-1)

Alnitak
In Eclipse you can set the enciding for the console, check out the preferences.
Yoni
+1  A: 

Have you verified that the value that gets stored in the database is actually U+003f (question mark)? There are all sorts of conventions for how to display characters that don't exist in the chosen font, and displaying them as ?' is fairly common.

So most likely, the character gets stored correctly, and for whatever reasons, simply gets displayed as '?'. Basically, ignore how it gets rendered, and look at what codepoint gets stored in the database. Is it U+9996 or U+003f (or something else entirely)? Don't blindly assume that just because it gets rendered as a question mark, it is actually a question mark that is stored in the database.

jalf
How do I verify the value in the database is correct? I don't see a SQL function to show codepoints.
Rowan
Read it back out with a java function and verify it at that point.
Darryl Braaten