views:

600

answers:

6

I am trying to figure out a 'proper' way of sorting UTF-8 strings in Ruby on Rails.

In my application, I have a select box that is populated with countries. As my application is localized, each existing locale has a countries.yml file that relates a country's id to the localized name for that country. I can't sort the strings manually in the yml file because I need the ID to be consistent across all locales.

What I have done is create a ascii_name method which uses the unidecode gem to convert accented and non-latin characters to their ascii equivalent (for instance, "Afeganistão" would become "Afeganistao"), and then sort on that:

require 'unidecode'

class Country
  def ascii_name
    Unidecoder.decode(name).gsub("[?]", "").gsub(/`/, "'").strip
  end
end

Country.all.sort_by(:&ascii_name)

However, there are obvious issues with this:

  • It cannot properly sort non-latin locales, as there may not be a direct analogous latin character.
  • It makes no distinction between a letter and all accented forms of that letter (so, for instance, A and Ä become interchangeable)

Does anyone know of a better way that I could sort my strings?

A: 

Would it be possible to add an Order or ID attribute to your countries.yml that way you can sort manually and still preserve a common identifier?

Jason Sperske
I suppose so, but that's not the solution I'm looking for, because it'd require a lot of extra manual work. Also, the people who maintain the translations are generally non-technical and thus may not understand what to do with an order attribute.
Daniel Vandersluis
+1  A: 

There are a couple of ways to go. You may want to convert the UTF strings to hex strings and then sort them:

s.split(//).collect { |x| x.unpack('U').to_s }.join

or you may use the library iconv. Read up on it and use it as appropriate (from dzone):

#add this to environment.rb
#call to_iso on any UTF8 string to get a ISO string back
#example : "Cédez le passage aux français".to_iso

class String
  require 'iconv' #this line is not needed in rails !
  def to_iso
    Iconv.conv('ISO-8859-1', 'utf-8', self)
  end
end
Ryan Oberoi
Hm, sorting by the hex value does seem to put my strings in the alphabetical order, but I don't really understand how it's working, can you explain that? Also, it's still sorting Á before A, which seems backwards to me.
Daniel Vandersluis
Also watch out: Unicode sorting depends on the locale! Different countries have a different order in their dictionary.
Rutger Nijlunsing
Well, converting to hex gives you an ordering that is better understood by sort functions. I would experiment a bit, by using hex values formatted to 2 or 3 decimal places. or even use decimal values for each character. I am not a big UTF user myself, but it appears from Rutger's comments that what you are trying to do does not have an exact answer.
Ryan Oberoi
@Rutger that's what I'm trying to figure out how to implement, I guess, and is another downfall of my current method (or sorting by character code)
Daniel Vandersluis
A: 

Have you tried accessing the mb_chars method for each of your country strings? mb_chars is a proxy that ActiveSupport adds that defines Unicode safe versions of all the String methods. If the comparator is Unicode-aware then the sorting should work correctly.

John Topley
The problem with using mb_chars is the same as sorting straight; because in the character set A-Z comes before Ä, accented characters will not sort into the correct location.
Daniel Vandersluis
+3  A: 

http://github.com/grosser/sort_alphabetical/tree/master

maybe this plugin may help

İ. Emre Kutlu
Thanks, that was exactly the sort of plugin I was looking for!
Daniel Vandersluis
This plugin relies on NFD decomposition http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms and fails in some cases. Not all diacritic characters can be decomposed this way (for example Polish letter Ł can not).
skalee
A: 

What you are trying to do is a very messy proposition. There is no way to do transparent transliteration on all Unicode characters because the meaning of digraphs changes from locale to locale, and strings can grow HUGE (if say you replace 10 Chinese symbols with theyr phonetic equivalents). Don't go there.

Why do you want transliterated names in the first place? For URLs? Browsers handle Unicode URLs decently now, so you are inventing a huge problem out of thin air. If you need IDs, preprocess your lists to include a stable numeric ID per country and use that as an identifier. Or save the English name of the country as identitifer (you can download locale-aware ISO country lists for free).

If you truly want good transliteration for Unicode (and this is not what you want in this case) see the IBM ICU libraries, there is a dormant gem for them.

Julik
+1  A: 

The only working solution I found so far (at least for Ruby 1.8 because Ruby 1.9 should handle Unicode better) is Unicode by Yoshida Masato. You can find Unicode.strcmp method there.

EDIT: Sorry, this solution uses NFD decomposition as well with all its limitations.

skalee