tags:

views:

498

answers:

1

I have a hash names hsh that has values that are UTF-8 encoided. For example:

hsh ={:name => some_utf_8_string, :text => :some_other_utf_8_string}

I am currently doing the following:

$KCODE="UTF8"

File.open("save.tsv","w") do{|file|

file.puts hsh.values.map{|x| x.to_s.gsub("\t",' ')}.join("\t")

}

But this croaks randomly because I think some of the multibyte contents sort of match "\t" and it fails. Is there a recommended string I can use instead of "\t" and also is there a better way of doing the above?

Thanks

+2  A: 

If your data is valid utf8, there is no way for a tab character to "sort of" match part of a multibyte sequence (this is one of the advantages of utf8 over some other multibyte encodings). Can you go into more detail about what you mean by "croak"?

Logan Capaldo
Logan's right - in UTF8, there are three kinds of bytes - the ones covering 7-bit ascii (0XXXXXXX), the first byte of multi-byte characters (110XXXXX, 1110XXXX, 11110XXX) or a followup byte of a multi-byte character (10XXXXXX). Tab (00000101=0x09) only matches itself, not any part of a multi-byte.
rampion