It seems to me that the YAML library that ships with ruby 1.9 is encoding-deaf.
What this means is that when generating YAML, it'll take any string of bytes, and escape any byte sequence that doesn't output clean ASCII. That's lame, but acceptable.
My problem is the other way around. When loading content from said YAML dump.
In the example that follows I create a UTF-8 string, dump it, it's dumped with the type !binary
. When I load it back, it has the encoding ASCII-8BIT. In the end of the example I try to concatenate both the original and the reloaded string with another UTF-8 string. The latter will fail with an Encoding::CompatibilityError
.
require 'yaml'
s0 = "Iñtërnâtiônàlizætiøn"
y = s0.to_yaml
s1 = YAML::load y
puts s0 # => Iñtërnâtiônàlizætiøn
puts s0.encoding # => UTF-8
puts s1 # => Iñtërnâtiônàlizætiøn
puts s1.encoding # => ASCII-8BIT
puts y # => --- !binary |
# ScOxdMOrcm7DonRpw7Ruw6BsaXrDpnRpw7hu
puts "ñårƒ" + s0 # => ñårƒIñtërnâtiônàlizætiøn
puts "ñårƒ" + s1 # => Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
I think it's clear how this will quickly lead to trouble when you're dealing with some YAML source containing nested hashes and arrays with leaf strings.
Currently I have some code that traverses all hashes and arrays and calls force_encoding
on each string. That, to say the least, is unsightly.
What I'm looking for right now is a way to tell YAML::load
that any string that comes in should be treated as, and therefore have its encoding set to UTF-8.
Ideally, ruby's YAML should just annotate the strings it dumps with the proper encoding. There's a Ya2YAML project that attempts to dump UTF-8 safe YAML. I'm not sure how far along it is. If anyone has played with it, I welcome any thoughts.
Regardless of that, I still have these dumps without any encoding information to deal with. Although I know they are all UTF-8.