views:

2004

answers:

3

I am trying to use a Rake task to migrate some legacy data from MS Access to MySQL. I'm working on Windows XP, using Ruby 1.8.6.

I have the encoding for Rails set as "utf8" in database.yml.

Also, the default character set for MySQL is utf8.

99% of the data is coming in fine, but every now and then I'll get a column value that gives me a error something like this:

Mysql::Error: Incorrect string value: '\x92 Comm...' for column 'name' 
  at row 1: 
  INSERT INTO `organizations` ( [...] ) 
  VALUES('Lawyers’ Committee', [...] )

It looks as though the thing that's giving MySQL trouble is the apostrophe immediately after the "s" in the word "Lawyers".

Here's another one...

Mysql::Error: Incorrect string value: '\x99 aoc' for column 'department' 
  at row 1: 
  INSERT INTO `addresses` 
[...]
  'TRInfo™ aoc'
[....]

Looks like it's choking on the "TM" after "TRInfo".

Is there any Ruby or Rails method that I can run the data through to cleanse from it any characters that MySQL will choke on?

Ideally, it would be great to replace them with more palatable characters -- replace the apostrophe with a single quote and the TM symbol with the string "(TM)".

Or, if I could somehow configure MySQL to store those characters as-is without errors that would be great too.

A: 

It looks like your old database is in one string format (utf8?) and your rails is expecting something else. If you input is in utf8, have you tried configuring your rails to support it?

Todd Gardner
Thanks, everything is set to utf8.
Ethan
+3  A: 

It looks like your input data is not in utf-8.

I did a little investigating and the styled quote used in Lawyer's is encoded as \x92 in the Windows-1252 encoding, but would be nonsense for utf-8 (when I decoded it and encoded it into utf8, I got \xe2\x80\x99).

Thus you will need to convert the input strings from windows-1252 to utf-8 (or to unicode).

Kathy Van Stone
A: 

I had the same problem when putting contents of UTF-16 encoded files - which usually store one character per 16bit block - into mysql tables with java. The problem was that the UTF-16 encoded string contained so called surrogate pairs. It means two consecutive 16bit UTF-16 blocks encode one special character but cannot be translated into a corresponding UTF-8 encoding individually. See wikipedia for further explanation.

The solution was to simply replace these characters with spaces. This is the character range you might want to strip out of your string: U+D800–U+DFFF

lex82