Man, this character encoding hole just keeps on getting deeper. Sigh. Ok. Check this out: I have a java String that contains the unicode character U+9996 (that's what I get if I do codePointAt()). If I look at it in the debugger expressions panel (in eclipse) then all is well and it looks like "首". However if I print it out to the console I get simply "?". It doesn't seem to be the font that's the problem as I've tried setting that differently.
My real problem is that I'm trying to put the string into a MySQL database (with utf8 encoding). Lots of other wide characters show up fine in the db but, again, this one and some others like it show up as "?". All of which leads me to believe that the problem is on the java side.
In chasing down this bug I've learnt a little about Unicode Normalization and java.text.Normalizer which looks like it might be relevant in this case. I've learnt that U+9996 is the canonical version of U+2FB8. U+2FB8 has exactly the same problems above though as regards display and anyway why would I want to transform to a non-canonical representation (even if I could, which I don't think I can)?
Anyway, there's one potential clue I've found which I've been unable to comprehend. This page contains the words "U+9996 is not a valid unicode character" with no further explanation. It then proceeds to show how to encode this supposedly non-valid unicode character in various unicode encodings. So my question is this basically: WTF?
UPDATES
- I'm on a Mac.
- I'm talking about the Eclipse console.
- I set the console encoding to UTF-8 under Run > Common
- I added
-Dfile.encoding=UTF-8
to the JVM arguments (the default was MacRoman) - The console (Eclipse and Terminal.app) now show the right chars. Hooray!
- I'm mostly interested in the data getting into the database correctly though of course I'd like to get a total understanding of what's going on here.
- I think I've fixed the database problem. I forgot to set the encoding on the connection. Now I don't understand why some asian characters were getting through and not others.
- Phew, stackoverflow moves fast. It's hard to keep up. Thanks people.