tags:

views:

295

answers:

2

I am getting Encoding::UndefinedConversionError at /find/Wrocław "\xC5" from ASCII-8BIT to UTF-8

For some mysterious reason sinatra is passing the string as ASCII instead of UTF-8 as it should.

I have found some kind of ugly workaround... I don't know why Rack assumes the encoding is ASCII-8BIT ... anyway, a way is to use string.force_encoding("UTF-8")... but doing this for all params is tedious

+1  A: 

I was having some similar problems with routing to "/protégés/:id". I posted to the Rack mailing list, but the response wasn't great.

The solution I came up with isn't perfect, but it works for most cases. First, create a middleware that unencodes the UTF-8:

# in lib/fix_unicode_urls_middleware.rb:
require 'cgi'
class FixUnicodeUrlsMiddleware
  ENVIRONMENT_VARIABLES_TO_FIX = [
    'PATH_INFO', 'REQUEST_PATH', 'REQUEST_URI'
  ]

  def initialize(app)
    @app = app
  end

  def call(env)
    ENVIRONMENT_VARIABLES_TO_FIX.each do |var|
      env[var] = CGI.unescape(env[var]) if env[var] =~ /%[A-Za-z0-9]/
    end
    @app.call(env)
  end
end 

Then use that middleware in your config/environment.rb (Rails 2.3) or config/application.rb (Rails 3).

You'll also have to ensure you've set the right encoding HTTP header:

Content-type: text/html; charset=utf-8

You can set that in Rails, in Rack, or in your web server, depending on how many different encodings you use on your site.

James A. Rosen
A: 

AFAIK you are not supposed to have raw UTF-8 characters in URLs but must % encode them , not doing so will likely cause all kind of issues with say standard compliant proxies. It looks like it's not so much a Rack issue but a problem with the application emitting invalid URLs. The charset and encoding information in the HTTP header applies to the content not the header itself.

To quote RFC 3986

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

Bruno Rohée