views:

69

answers:

2

I've declared a field in my INNODB/MySQL table as

VARCHAR(255) CHARACTER SET utf8 NOT NULL

however when inserting my data is truncated at 255 bytes not characters. This might chop the trailing two bite code point i*emphasized text*n two leaving an invalid character. Any ideas what I might be doing wrong

EDIT:

A sample session is like this

mysql> update channel set comment="ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ᛬x" where id = 1;
Query OK, 0 rows affected, 1 warning (0.00 sec)
Rows matched: 1  Changed: 0  Warnings: 1

mysql> select id, channelName, comment from channel;
+----+-------------+------------------------------------------------------------------------------------------
| id | channelName | comment                                                                                                                                                                                                                                                         |
+----+-------------+-----------------------------------------------------------------------------------------
|  1 | foo         | ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩ�� |
+----+-------------+-----------------------------------------------------------------------------------------
1 row in set (0.00 sec)

via mysql-admin I look at the comment field and see that it is indeed VARCHAR(255) and uses "UTF-8 Unicode"

from the command

show full columns from channel

I get

+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field                       | Type             | Collation       | Null | Key | Default | Extra          | Privileges                      | Comment |
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id                          | int(11)          | NULL            | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| channelName                 | varchar(255)     | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| comment                     | varchar(255)     | utf8_general_ci | NO   |     | NULL    |                | select,insert,update,references |         |
+-----------------------------+------------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+

mysql> SHOW VARIABLES LIKE 'character_set%'

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+5  A: 

According to the manual, you should be fine:

MySQL interprets length specifications in character column definitions in character units. (Before MySQL 4.1, column lengths were interpreted in bytes.) This applies to CHAR, VARCHAR, and the TEXT types.

Do you happen to be using a pre-4.1 version of mySQL?

Pekka
Oh, interesting. One of us is wrong. I’d be amazed to learn MySQL uses characters here. Looking into it.
Scytale
@Scytale your explanation indeed makes more sense, and I too would have assumed it uses bytes. The change to characters may have to do with the fact that `VARCHAR()` fields can exceed 255 bytes since 4.1
Pekka
Wow. Sorry I did not believe you. Indeed it’s characters for MySQL ≥ 4.1. I’ll delete my wrong answer. (For the record: The storage calculation to achieve character limits is as follows: M × w bytes, 0 <= M <= 255, where w is the number of bytes required for the maximum-length character in the character set.)
Scytale
@Scytale thanks for the confirmation and the info!
Pekka
strange because my UTF-8 strings seems to be being truncated at 255 bytes not characters. Using MySQL 5.1 on ubuntu, standard install from apt.
bradgonesurfing
@brad that's strange. You're sure they don't get cut off because of broken character input (like ISO-8859-1 characters getting fed into a UTF-8 field)?
Pekka
See the edit I made above of a sample UTF-8 update on the table.
bradgonesurfing
@Scytale: For UTF-8, *w* = 4, so does a `VARCHAR(255)` then consume 1020 bytes?
Philipp
I made another edit showing the output from show full columns from channel and it is ineed VARCHAR(255) UTF8
bradgonesurfing
I even put the full update SQL text in a file and ran "file -i /tmp/foo.sql" and it told me the file was utf-8
bradgonesurfing
@brad really strange. Does it behave differently if you change the type to `TEXT`?
Pekka
Hammerite found the problem was that my connection was latin1 encoded. But thanks to everybody here for pitching in.
bradgonesurfing
+2  A: 

This is a stab in the dark, but are you using UTF-8 as the connection and client character sets? Issue SHOW VARIABLES LIKE 'character_set%' and see whether it tells you UTF-8 or latin-1.

Perhaps if you are using the wrong connection/client character sets, the UTF-8 bytes are reinterpreted as single-byte characters and stored that way in the database.

Hammerite
I've updated the question with the output from what you asked. Does this help?
bradgonesurfing
Your connection character set and client character set are both `latin1`. Try the following: issue `SET NAMES 'utf8'` and then reattempt the insert statement that didn't work. If it works then the problem is the connection and client character sets.
Hammerite
set NAMES 'utf8' indeed solved the problem. What exactly did that do?
bradgonesurfing
It means that MySQL presumes that when you communicate with it, your strings are encoded in UTF-8 rather than Latin-1. That means that all of your UTF-8 encoded strings arrive safely, instead of being reinterpreted as Latin-1 strings. I think what was happening here is that your UTF-8 string was being interpreted as a series of characters - one character per byte. Hence 255 "characters" (bytes) corresponding to the first 255 bytes of the string were stored.
Hammerite
Super explanation. Can this go in the connection string rather than as an extra SQL command?
bradgonesurfing
CharSet=UTF8 I answer my own question.
bradgonesurfing
If you have `SUPER` privileges, you can change the default character sets for new connections by issuing `SET GLOBAL character_set_client = 'utf8'; SET GLOBAL character_set_results = 'utf8'; SET GLOBAL character_set_connection = 'utf8';`
Hammerite
+1 for great explanation.
Pekka