views:

556

answers:

1

It seems to me that the YAML library that ships with ruby 1.9 is encoding-deaf.

What this means is that when generating YAML, it'll take any string of bytes, and escape any byte sequence that doesn't output clean ASCII. That's lame, but acceptable.

My problem is the other way around. When loading content from said YAML dump.

In the example that follows I create a UTF-8 string, dump it, it's dumped with the type !binary. When I load it back, it has the encoding ASCII-8BIT. In the end of the example I try to concatenate both the original and the reloaded string with another UTF-8 string. The latter will fail with an Encoding::CompatibilityError.

require 'yaml'
s0 = "Iñtërnâtiônàlizætiøn"
y  = s0.to_yaml
s1 = YAML::load y
puts s0                 # => Iñtërnâtiônàlizætiøn
puts s0.encoding        # => UTF-8
puts s1                 # => Iñtërnâtiônàlizætiøn
puts s1.encoding        # => ASCII-8BIT
puts y                  # => --- !binary |
                        #    ScOxdMOrcm7DonRpw7Ruw6BsaXrDpnRpw7hu
puts "ñårƒ" + s0        # => ñårƒIñtërnâtiônàlizætiøn
puts "ñårƒ" + s1        # => Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

I think it's clear how this will quickly lead to trouble when you're dealing with some YAML source containing nested hashes and arrays with leaf strings.

Currently I have some code that traverses all hashes and arrays and calls force_encoding on each string. That, to say the least, is unsightly.

What I'm looking for right now is a way to tell YAML::load that any string that comes in should be treated as, and therefore have its encoding set to UTF-8.


Ideally, ruby's YAML should just annotate the strings it dumps with the proper encoding. There's a Ya2YAML project that attempts to dump UTF-8 safe YAML. I'm not sure how far along it is. If anyone has played with it, I welcome any thoughts.

Regardless of that, I still have these dumps without any encoding information to deal with. Although I know they are all UTF-8.

A: 

Firstly the textfile that you're attempting to read must be UTF-8 encoded(this should be your YAML file).

Then add this line to the top of your ruby file, hash and all

# encoding: UTF-8

This will mean that the default encoding for all strings will be UTF-8, and should mean that any text you dump with YAML.dump('text') or even string literals 'like this' should also be encoded UTF-8, and all should work well from here on in.

Dublinclontarf
Doesn't matter. Stuff read as !binary from the yaml end up as ASCII-8BIT, for maybe obvious and sane reasons, but so yaml should dump unescaped UTF8 strings properly. I more or less have a solution to this but involves a good chunk code. I'll post an answer when I have a gem ready.
kch