views:

107

answers:

4

Short story: I can't get an entity like '𠂉' to store in a MySQL database, either by using a text field in a Ruby on Rails app (with default UTF-8 encoding) or by inputting it directly with a MySQL GUI app.

As far as I can tell, all Chinese characters and radicals can be entered into the database without problem, but not these rarely typed 'character components.' The character mentioned above is unicode U+20089 and html entity &#131209; I can get it to display on the page by entering <html>&#131209;</html> and removing html escaping, but I would like to store it simply as the unicode character and keep the html escaping in place. There are many other Chinese 'components' (parts of full characters, generally consisting of 2 or 3 strokes) that cause the same problem.

According to this page, the character mentioned is in the UTF-8 charset: http://www.fileformat.info/info/unicode/char/20089/charset_support.htm

But on the neighboring '...20089/index.htm' page, there's an alert saying it's not a valid unicode character.

For reference, that entity can be found in Mac OS X by searching through the character palette (international menu, "Show Character Palette"), searching by radical, and looking under the '丿' radical.

Apologies if this is too open-ended... can a character like this be stored in a UTF-8-based database? How is this character both supported and unsupported, both present in the character set and not valid?

A: 

what if you double encode it and store ?

get it encoded once again and stored. and later upon retrieval decode it once and render in html.

Wind Chimez
A: 

I can't answer the question of it being listed as both supported and unsupported, that's probably a question for the people running the fileformat.info site.

UTF-8 can be used to represent any Unicode character (code point). This is true of all of the UTFs. The number of bytes required to do so varies (in UTF-8, you need four for the code point you identified, for instance, whereas you only need one for the Roman letter 'A'), but all Unicode characters can be represented by all UTFs. That's what they're for. (More here.)

It sounds as though you're running into an encoding problem at one (or more) of the layers in your app. The first place to look would be the page served by your app: Does it say what charset it's using? It may be worth checking the headers being returned for your pages to see if they have:

Content-Type: text/html; charset="UTF-8"

...in them. If they don't, look for the equivalent meta tag in the HTML itself, though I seem to recall reading that meta isn't a good way to do this. Absent the headers being specific, the default applied will probably be ISO-8859-1 (though some browsers may use Windows-1252 instead), which won't work if your source text is encoded with UTF-8.

The next place to look is your database. I don't think MySQL stores text in UTF-8 by default, you'll need to ensure that it's doing that in your MySQL configuration.

From your question, I don't think you need it, but I'll finish with the obligatory plug for the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (if only to save someone from plugging it in a comment). :-)

T.J. Crowder
In fact the meta info was not on the page, but adding it made no difference. I'm going to follow up on Alan Moore's suggestion of updating MySQl to 5.5 and see if that does it.
Steve Cotner
@Steve: Well, you don't need the `meta` if the header is there (and the header is the better way to go, if you can configure your server to send it). Of course, you need to make sure what you're actually sending really is UTF-8, as well. Good luck with it.
T.J. Crowder
+1  A: 

U+20089 is a defined character in the Unicode set (CJK Unified Ideographs Extension B) and becomes the byte sequence F0 A0 82 89 when encoded as UTF-8. The problem is probably not with the character, but character handling by the software somewhere in your stack.

In the unlikely event that there is an inherent technical reason for this being a problem character, it is likely to be covered in the Unicode standard or in the FAQs.

McDowell
+6  A: 

Which version of MySQL are you using? If it's before 5.5, you can't store that character because it would take four bytes and MySQL only supports up to three bytes UTF-8 (i.e., characters in the BMP). MySQL 5.5 added support for four-byte UTF-8, but you have to specify utf8mb4 as the Character Set.

ref: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html

Alan Moore
I was using 5.0.67. I'm now downloading mysql-5.5.3-m3-osx10.5-x86 and I'll post the results once I've updated. I don't know how to convert an already existing database to utf8mb4, so I'll just make a new rails app with utf8mb4 specified as the encoding and see if i can add the character. I assume I can just change the database.yml file to read "encoding: utf8mb4"
Steve Cotner
I updated mysql, reinstalled the rails gem, and made a new database with "encoding: utf8mb4" in the database.yml file.There is a small but perhaps significant change:Now, instead of the database not storing anything, there is a question mark where the '' character should be.Also, the database used to lose everything I'd typed after the offending character. Now, everything that comes afterward is saved without issue, and so the only problem is that the character shows up as ? instead of .That's progress...
Steve Cotner
Ah, but Rails didn't actually make the table utf8mb4! Not sure how I'll make that happen automatically in the future, but this is what I did for now... Using a MySQL GUI, I modified the table and each relevant column to be utf8mb4, using the SQL queries provided at http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html. And voilà. Long live . Thanks!
Steve Cotner