views:

656

answers:

2

Ruby will not play nice with UTF-8 strings. I am passing data in an XML file and although the XML document is specified as UTF-8 it treats the ascii encoding (two bytes per character) as individual characters.

I have started encoding the input strings in the '\uXXXX' format, however I can not figure out how to convert this to an actual UTF-8 character. I have been searching all over on this site and google to no avail and my frustration is pretty high right now. I am using Ruby 1.8.6

Basically, I want to convert the string '\u03a3' -> "Σ".

What I had is:

data.gsub /\\u([a-zA-Z0-9]{4})/,  $1.hex.to_i.chr

Which of course gives "931 out of char range" error.

Thank you Tim

+1  A: 

Does something break because Ruby strings treats UTF-8 encoded code points as two characters? If not, then that you should not worry too much about that. If something does break, then please add a comment to let us know. It is probably better to solve that problem instead of looking for a workaround.

If you need to do conversions, look at the Iconv library.

In any case, Σ could be better alternative to \u03a3. \uXXXX is used in JSON, but not in XML. If you want to parse \uXXXX format, look at some JSON library how they do it.

hrnt
+1  A: 

Ruby (at least, 1.8.6) doesn't have full Unicode support. Integer#chr only supports ASCII characters and otherwise only up to 255 in octal notation ('\377').

To demonstrate:

irb(main):001:0> 255.chr
=> "\377"
irb(main):002:0> 256.chr
RangeError: 256 out of char range
        from (irb):2:in `chr'
        from (irb):2

You might try upgrading to Ruby 1.9. The chr docs don't explicitly state ASCII, so support may have expanded -- though the examples stop at 255.

Or, you might try investigating ruby-unicode. I've never tried it myself, so I don't know how well it'll help.

Otherwise, I don't think you can do quite what you want in Ruby, currently.

Jonathan Lonowski