ansaurus

Question

How to strip out invalid UTF-8 characters in Ruby 1.9

Answer 1

A:

What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.

$ export LANG=en_US

My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying

$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8

For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend http://blog.grayproductions.net/articles/ruby_19s_string

(code examples assume bash or similar shell - C-shell derivatives are different)

telent 2010-09-10 11:12:22

Excellent. I guess this is my karmic punishment for slinging strings around in C without caring about the encoding, or for being a native English speaker.

Doches 2010-09-10 11:23:44

@Doches: So, *you're* the guy who writes all those apps that won't let me use my actual name. BTW: admitting you have a problem is the first ... bla bla bla :-)

Jörg W Mittag 2010-09-10 17:44:35

ansaurus

tags:

views:

answers:

How to strip out invalid UTF-8 characters in Ruby 1.9

related questions