views:

3037

answers:

5

I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).

Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.

Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?

+2  A: 

If you're on Ruby 1.9...

string_in_windows_1252 = database.get(...)
# => "Fåbulous"

string_in_windows_1252.encoding
# => "windows-1252"

string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
# => "Fabulous"

string_in_utf_8.encoding
# => 'UTF-8'
James A. Rosen
Thanks, that's good to know. However, I'm currently dealing with Ruby 1.8.6. I guess I could install 1.9, RubyGems, etc.
Ethan
I certainly hope that ruby doesn't encode "Fåbulous" into "Fabulous" as 'å' is a very different character from 'a' for any language that sports it. If the UTF-8 encoded string would be printed in Windows 1252 codepage the string ought to look something like "FÃ¥bulous" and if it's printed in UTF-8 it should be "Fåbulous".
Andreas Magnusson
+1  A: 

If you want to convert a file named win1252file, on a unix OS, run:

$ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file

You should probably be able to do the same on Windows with cygwin.

yhager
Thanks, great to know for the future, but I'm not sure how I would do that in this situation as I'm dealing with MS Access and MySQL in a Windows XP environment, not Cygwin.
Ethan
+2  A: 

If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try

File.open('/tmp/w1252', 'w') do |file|
  my_windows_1252_string.each_byte do |byte|
    file << byte
  end
end

`iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`

my_utf_8_string = File.read('/tmp/utf8')

['/tmp/w1252', '/tmp/utf8'].each do |path|
  FileUtils.rm path
end
James A. Rosen
Whoops: this fails for the same reason yhager's does: Windows has no iconv (or /tmp/ directory, for that matter). Hmm...
James A. Rosen
+5  A: 

For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:

Iconv documentation

According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

One might then attempt to do a full conversion like so:

ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
austinfromboston
+1  A: 

Hy,

I had the exact same problem.

These tips helped me get goin:

Always check for the proper encoding name in order to feed your conversion tools correctly. In doubt you can get a list of supported encodings for iconv or recode using:

$ recode -l

or

$ iconv -l

Always start from you original file and encode a sample to work with:

$ recode windows-1252..u8 < original.txt > sample_utf8.txt

or

$ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt

Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is. File.open has a new 'mode' parameter in Ruby 1.9. Use it! This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

File.open('original.txt', 'r:windows-1252:utf-8')
# This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.

Have fun and swear a lot!

Overbryd