ansaurus

Question

'' Not a valid unicode character, but in the unicode character set?

Answer 1

A:

what if you double encode it and store ?

get it encoded once again and stored. and later upon retrieval decode it once and render in html.

Wind Chimez 2010-06-17 09:27:59

Answer 2

A:

I can't answer the question of it being listed as both supported and unsupported, that's probably a question for the people running the fileformat.info site.

UTF-8 can be used to represent any Unicode character (code point). This is true of all of the UTFs. The number of bytes required to do so varies (in UTF-8, you need four for the code point you identified, for instance, whereas you only need one for the Roman letter 'A'), but all Unicode characters can be represented by all UTFs. That's what they're for. (More here.)

It sounds as though you're running into an encoding problem at one (or more) of the layers in your app. The first place to look would be the page served by your app: Does it say what charset it's using? It may be worth checking the headers being returned for your pages to see if they have:

Content-Type: text/html; charset="UTF-8"

...in them. If they don't, look for the equivalent meta tag in the HTML itself, though I seem to recall reading that meta isn't a good way to do this. Absent the headers being specific, the default applied will probably be ISO-8859-1 (though some browsers may use Windows-1252 instead), which won't work if your source text is encoded with UTF-8.

The next place to look is your database. I don't think MySQL stores text in UTF-8 by default, you'll need to ensure that it's doing that in your MySQL configuration.

From your question, I don't think you need it, but I'll finish with the obligatory plug for the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky (if only to save someone from plugging it in a comment). :-)

T.J. Crowder 2010-06-17 09:42:48

In fact the meta info was not on the page, but adding it made no difference. I'm going to follow up on Alan Moore's suggestion of updating MySQl to 5.5 and see if that does it.

Steve Cotner 2010-06-17 10:16:35

@Steve: Well, you don't need the `meta` if the header is there (and the header is the better way to go, if you can configure your server to send it). Of course, you need to make sure what you're actually sending really is UTF-8, as well. Good luck with it.

T.J. Crowder 2010-06-17 10:45:31

Answer 3

+1 A:

U+20089 is a defined character in the Unicode set (CJK Unified Ideographs Extension B) and becomes the byte sequence F0 A0 82 89 when encoded as UTF-8. The problem is probably not with the character, but character handling by the software somewhere in your stack.

In the unlikely event that there is an inherent technical reason for this being a problem character, it is likely to be covered in the Unicode standard or in the FAQs.

McDowell 2010-06-17 09:44:35

Answer 4

+6 A:

Which version of MySQL are you using? If it's before 5.5, you can't store that character because it would take four bytes and MySQL only supports up to three bytes UTF-8 (i.e., characters in the BMP). MySQL 5.5 added support for four-byte UTF-8, but you have to specify utf8mb4 as the Character Set.

ref: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html

Alan Moore 2010-06-17 09:49:04

I was using 5.0.67. I'm now downloading mysql-5.5.3-m3-osx10.5-x86 and I'll post the results once I've updated. I don't know how to convert an already existing database to utf8mb4, so I'll just make a new rails app with utf8mb4 specified as the encoding and see if i can add the character. I assume I can just change the database.yml file to read "encoding: utf8mb4"

Steve Cotner 2010-06-17 10:23:53

I updated mysql, reinstalled the rails gem, and made a new database with "encoding: utf8mb4" in the database.yml file.There is a small but perhaps significant change:Now, instead of the database not storing anything, there is a question mark where the '' character should be.Also, the database used to lose everything I'd typed after the offending character. Now, everything that comes afterward is saved without issue, and so the only problem is that the character shows up as ? instead of .That's progress...

Steve Cotner 2010-06-17 11:47:40

Ah, but Rails didn't actually make the table utf8mb4! Not sure how I'll make that happen automatically in the future, but this is what I did for now... Using a MySQL GUI, I modified the table and each relevant column to be utf8mb4, using the SQL queries provided at http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html. And voilà. Long live . Thanks!

Steve Cotner 2010-06-17 12:36:14

ansaurus

tags:

views:

answers:

'' Not a valid unicode character, but in the unicode character set?

related questions