views:

455

answers:

2

I've isolated a problem with Ruby on Rails where a model with a serialized column is not properly loading data that has been saved to it.

What goes in is a Hash, and what comes out is a YAML string that can't be parsed due to formatting issues. I'd expect that a serializer can properly store and retrieve anything you give it, so something appears to have gone wrong.

The troublesome string in question is formatted something like this:

message_text = <<END

  X
X
END

yaml = message_text.to_yaml

puts yaml
# =>
# --- |
#
#   X
# X

puts YAML.load(yaml)
# => ArgumentError: syntax error on line 3, col 0: ‘X’

The combination of newline, indented second line, and non-indented third line causes the parser to fail. Omitting either the blank line or the indentation appears to remedy the problem, but this does seem to be a bug in the serialization process. Since it requires a rather unique set of circumstances, I'm willing to bet this is some strange edge-case that isn't properly handled.

The YAML module that ships with Ruby and is used by Rails looks to delegate a large portion of the processing to Syck, yet does provide Syck with some hints as to how to encode the data it is sending.

In yaml/rubytypes.rb there's the String#to_yaml definition:

class String
  def to_yaml( opts = {} )
    YAML::quick_emit( is_complex_yaml? ? self : nil, opts ) do |out|
      if is_binary_data?
        out.scalar( "tag:yaml.org,2002:binary", [self].pack("m"), :literal )
      elsif to_yaml_properties.empty?
        out.scalar( taguri, self, self =~ /^:/ ? :quote2 : to_yaml_style )
      else
        out.map( taguri, to_yaml_style ) do |map|
          map.add( 'str', "#{self}" )
          to_yaml_properties.each do |m|
            map.add( m, instance_variable_get( m ) )
          end
        end
      end
    end
  end
end

There appears to be a check there for strings that start with ':' and could be confused as Symbol when de-serializing, and the :quote2 option should be an indication to quote it during the encoding process. Adjusting this regular expression to catch the conditions described above does not appear to have any effect on the output, so I'm hoping someone more familiar with the YAML implementation can advise.

+4  A: 

Yep, that looks like a bug in the C syck library. I checked it out using the PHP syck bindings (v 0.9.3): http://pecl.php.net/package/syck and the same bug is present, indicating it is a bug in the library as opposed to the ruby yaml library or ruby-syck bindings:

// phptestsyck.php
<?php
$message_text = "

  X
X
";

syck_load(syck_dump($message_text));
?>

Running this on the cli gives the same SyckException:

$ php phptestsyck.php 
PHP Fatal error:  Uncaught exception 'SyckException' with message 'syntax error on line 5, col 0: 'X'' in /.../phptestsyck.php:8
Stack trace:
#0 /.../phptestsyck.php(8): syck_load('--- %YAML:1.0 >...')
#1 {main}
  thrown in /.../phptestsyck.php on line 8

So, I suppose you could try to fix Syck itself. It appears that the library hasn't been updated since v0.55 in May of 2005 (http://rubyforge.org/projects/syck/), though.

Alternately, there is a pure-ruby yaml parser called RbYAML (http://rbyaml.rubyforge.org/) which originated with JRuby that doesn't appear to have this bug:

>> require 'rbyaml'
=> true
>> message_text = <<END

  X
X
END
=> "\n  X\nX\n"
>> yaml = RbYAML.dump(message_text)
=> "--- "\\n  X\\nX\\n"\n"
>> RbYAML.load(yaml)
=> "\n  X\nX\n"
>>

Finally, have you considered another serialization format altogether? Ruby's Marshal library doesn't have this bug either and is faster than Yaml (see http://significantbits.wordpress.com/2008/01/29/yaml-vs-marshal-performance/):

>> message_text = <<END

  X
X
END
=> "\n  X\nX\n"
>> marshal = Marshal.dump(message_text)
=> "\004\b"\f\n  X\nX\n"
>> Marshal.load(marshal)
=> "\n  X\nX\n"
bantic
Thanks for the insight. I'm still not sure who to contact to resolve this issue, but there is a Github project (http://github.com/indeyets/syck/) which is carrying on from the old _why repository. I came across this issue when serializing HAML HTML output to an ActiveRecord column. HAML, coincidentally, formats HTML with the same two-space structure as YAML and triggers this edge case.
tadman
Did you consider using another serialization scheme like Marshal? Does it have to be yaml?
bantic
What I need is something that's able to fix the behavior of ActiveRecord serialized columns. If there's a way to use Marshal instead, I'm all for it, but I haven't seen any information as to how to do this without some serious monkeypatching.
tadman
+1  A: 

You have to give up the easy serialize ActiveRecord::Base method to do so, but it's not hard otherwise to use your own serializing scheme. For example, to serialize some field called 'person_data':

class Person < ActiveRecord::Base
 def person_data
    self[:person_data] ? Marshal.load(self[:person_data]) : nil
  end

  def person_data=(x)
    self[:person_data] = Marshal.dump(x)
  end
end

## User Person#person_data as normal and it is transparently marshalled
p = Person.find 1
p.person_data = {:color => "blue", :food => "vegetarian"}

(See this ruby forum thread for more)

bantic
Looks like that could benefit from some caching, but could work. Thanks!
tadman
Ah, well, that only works if you actually write to your field with something frozen. For example, assigning a hash you later modify, only the initial state will be preserved.
tadman