views:

703

answers:

5

I'm having trouble saving UTF8 data in a form and having it correctly saved in mysql. In particular, via my ruby application I'm post a form that includes the following:

Gerhard Tröster

Which in my terminal I see is being updated in the database as:

UPDATE `xxxx` SET 
   `updated_at` = '2009-08-13 14:22:33', 
   `description` = '<p><span style=\"font-size: 14px; line-height: normal; white-space: pre; \">Gerhard Tr?ster</span></p>' 
WHERE `id` = 1228

However when I select from this table it says:

| description |
---------------
| Gerhard Tr | 

Note that it's simply truncating everything AFTER the umlaut, even though the insert appears to have included it (or something like it).

My database.yml has encoding set to UTF8, I've included the appropriate META tags in my HTML as well.

+1  A: 

The question mark in the db entry means it hasn't been updated correctly as utf8. You need to make sure that the db tables and columns have utf8 collation and that you set the connection to utf8 too. To ensure that you can use the mysql query SET NAMES 'UTF-8'.

(Furthermore I'm wondering why you're storing all this markup in your db?)

tharkun
Thank you. The ? I assumed was my terminal because it's showing what was *sent* to the DB. The DB simply does not include anything after and including the question mark.
PETER BROWN
+1 for markup in DB - the only reason I could see that being necessary is when you're storing user input as rich text
John Rasch
+1  A: 

There are (amazingly) four places you need to set the UTF-8 encoding in order to ensure your data gets saved in that format in mysql (why they don't use utf-8 as the default is beyond me): The connection, the database, the table and the columns. Specifying utf-8 in your database.yml takes care of the connection, the other three have to be set in mysql (using the caracter set, collate and set names commands).

Just for good measure, you might also need to add a utf-8 directive to your html headings, and in your environment; to make sure that it "takes" across the board.

Some helpful info here: http://word.wardosworld.com/?p=164

insane.dreamer
A: 

These issues are symptomatic of a few possible problems. Mostly nothing to do with Ruby.

1) Your form gets sent with an Accept-Charset different from UTF-8. This will happen if

  • the page the form gets sent from is itself not UTF-8, by meta tag or HTTP header (a form from a Latin-1 page will be Latin-1)
  • The form explicitly specifies that it is sent as something other than UTF-8
  • You are using Javascript to post the data and not escaping correctly, or your users do

In this case the browser might be downgrading Unicode to the charset it can send. In general, the assumed accept-charset of the form is the charset of the page that displays the form in the first place.

2) Your MySQL server is configured in a manner that proactively obstructs you from using UTF-8 for data storage, so MySQL silently downgrades your UTF to something else (say MySQL is forced to do SET NAMES SOME_CRAPPY_8BIT_CHARSET_OF_1990 on every connection, by the server admin. No joke - this happened to me once). Read this article which explains how to hardwire everything for UTF-8 with 100% certainity http://www.fngtps.com/2007/02/ruby-and-mysql-encoding-flakiness

3) Your terminal that you are looking at is not showing you UTF-8 and tries to recode it into Latin or ASCII, dropping characters it cannot display and replacing them with "?" (standard pattern). If you do "puts 'ü'" in plain Ruby with $KCODE set what do you see? Windows terminals are especially susceptible to this kind of behavior before special settings are in place.

4) You are running Ruby 1.9 whose handling of Unicode is a special matter altogether

5) Totally unlikely but who knows: you are using (or your hoster is using) some crappy proxy solution which mangles your charset headers or recodes the input being sent. I can bet on 2 and 3 with about 50% chance.

Julik
Thank you. I tried this in the same terminal I've referenced above:irb(main):001:0> "puts 'ü'"=> "puts '\303\274'"So it displays correctly on input and there you can see the output.
PETER BROWN
A: 

To make Ruby itself a little bit Unicode-aware, you need this line:

$KCODE = 'u'

I always put this line in config/environment.rb

And your database must be created with utf8 collation and you must have encoding set to UTF8 in database.yml.

sloser
Rails has been using UTF-8/Unicode by default for a very long time. No need to set $KCODE.
molf
A: 

Although it was already mentioned above:

Putting encoding: utf8 in database.yml solved it for me.

Reppek