views:

20

answers:

1

I' working on a Chinese/Japanese learning web app where many tables are indexed by the characters (the "glyphs") of those languages.

I'm wondering if the integer codepoint value of the glyph would be better for performance than using a single utf8 character (for primary key and indexes)?

Using a single utf8 character would be very useful because I can see the unicode characters fine in the shell I'm using, and this makes debugging the SQL querries of this app easier.

In theory MySQL would treat a single utf8 character as a unique integer value similarly to a mediumint (3 bytes)... but I suspect MySQL will handle the column as a string instead.

Would there be performance issues due to MySQL treating my single utf8 char as a string?

Would you recommend to stick to the integer codepoint for indexes and primary keys, and perhaps use CONVERT() or other operator to get the utf8 character in results?

+1  A: 

MySQL will store and index a UTF-8 character as a multi-byte string, yes. So I would expect integer to be a faster key, though the difference in performance is unlikely to be significant.

Another possible issue is that until MySQL 6.0, the utf8 character set doesn't support characters outside the Basic Multilingual Plane (ie it's limited to three bytes per character). If you want to use some of the really obscure kanji in the Supplementary Ideographic Plane, that'd be no good.

bobince