ansaurus

Question

Ruby's String#gsub, unicode, and non-word characters

Answer 1

+4 A:

You need to run ruby with the "-Ku" option to make it use UTF-8. See the documentation for command-line options. This is what happens when I do this with irb:

% irb -Ku
irb(main):001:0> my_str = "Quística."
=> "Quística."
irb(main):002:0> processed = my_str.gsub(/\W/,'')
=> "Quística"
irb(main):003:0>

You can also put it on the #! line in your ruby script:

#!/usr/bin/ruby -Ku

wdebeaum 2009-10-26 23:08:46

Gah. I thought I already was in UTF-8 mode. That sorts things out, thanks for the help!

Steven Bedrick 2009-10-26 23:39:05

Answer 2

+1 A:

I would just like to add that in 1.9.1 it works by default.

$ irb
ruby-1.9.1-p243 > my_str = "Quística."
=> "Quística."
ruby-1.9.1-p243 > processed = my_str.gsub(/\W/,'')
=> "Quística"
ruby-1.9.1-p243 > processed.encoding
=> #<Encoding:UTF-8>

PS. Nothing beats rvm for trying out different versions of Ruby. DS.

Jonas Elfström 2009-10-27 05:44:48

Ooooh, that's certainly nice to see. I haven't gotten around to playing with 1.9 yet, but I'm glad to see that it addresses some of 1.8's character encoding quirks.

Steven Bedrick 2009-10-28 15:04:41

It doesn't just address some of them, it addresses all of them. And all of Java's, C++'s, Python's, PHP's, ..., too. Ruby 1.9's encoding system is probably the most powerful, most complete evar, with the possible exception of only ELisp. It also *looks* insanely complicated, but that is because encoding *is* complicated. Java's encoding may *look* simpler, but have you ever seen a moderately complex piece of Java that actually *uses* `String`? No, all parsers, decoders, compilers, Regexp engines, XML libraries actually use `byte[]`, exactly *because* `String` is too simplistic.

Jörg W Mittag 2009-10-29 13:29:04

Well, I'll definitely have to check it out soon, then. I swear, if I could trade, say, a kidney for never having to deal with another character encoding issue again for the rest of my life, I might actually consider the deal. I mean, forget all the truly big, complicated encoding issues- just considering the stupid little ones like the one I described in the original question, how many collective hours of our lives have we wasted dealing with this crap? I'll tell you: Way. Too. Many.

Steven Bedrick 2009-11-02 05:10:04

ansaurus

tags:

views:

answers:

Ruby's String#gsub, unicode, and non-word characters

related questions