views:

4739

answers:

8

I have an ActiveRecord model, Foo, which has a name field. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_name field against which to search:

class Foo
  validates_presence_of :name

  before_validate :set_canonical_name

  private

  def set_canonical_name
    self.canonical_name ||= canonicalize(self.name) if self.name
  end

  def canonicalize(x)
    x.downcase.  # something here
  end
end

I need to fill in the "something here" to replace the accented characters. Is there anything better than

x.downcase.gsub(/[àáâãäå]/,'a').gsub(/æ/,'ae').gsub(/ç/, 'c').gsub(/[èéêë]/,'e')....

And, for that matter, since I'm not on Ruby 1.9, I can't put those Unicode literals in my code. The actual regular expressions will look much uglier.

+1  A: 

You probably want Unicode decomposition ("NFD"). After decomposing the string, just filter out anything not in [A-Za-z]. æ will decompose to "ae", ã to "a~" (approximately - the diacritical will become a separate character) so the filtering leaves a reasonable approximation.

MSalters
+1  A: 

iconv:

http://groups.google.com/group/ruby-talk-google/browse_frm/thread/8064dcac15d688ce?

=============

a perl module which i can't understand:

http://www.ahinea.com/en/tech/accented-translate.html

============

brute force (there's a lot of htose critters!:

http://projects.jkraemer.net/acts_as_ferret/wiki#UTF-8support

http://snippets.dzone.com/posts/show/2384

Gene T
+1 for the iconv thread.
obvio171
+3  A: 

Convert the text to normalization form D, remove all codepoints with unicode category non spacing mark (Mn), and convert it back to normalization form C. This will strip all diacritics, and your problem is reduced to a case insensitive search.

See http://blogs.msdn.com/michkap/archive/2005/02/19/376617.aspx and http://blogs.msdn.com/michkap/archive/2007/05/14/2629747.aspx for details.

CesarB
Related answer: http://stackoverflow.com/questions/285228/how-to-convert-utf-8-to-us-ascii-in-java#285791
CesarB
+4  A: 

I think that you maybe don't really what to go down that path. If you are developing for a market that has these kind of letters your users probably will think you are a sort of ...pip. Because 'å' isn't even close to 'a' in any meaning to a user. Take a different road and read up about searching in a non-ascii way. This is just one of those cases someone invented unicode and collation.

Jonke
+1 for suggestion to use appropriate database collation.
Constantin
I'm all for database collation, but someone might switch databases a year after I leave; I'd prefer to be defensive and at least do it in code, and possibly also in the DB. As for forcing the users to type what they mean: how many English users type résumé? Or "visual café"?
James A. Rosen
In the strip the poster has the letter å and ä. If you remove those to a the meaning of the word they are in are meaningless. You can't strip those and use what is left. If You really work for a European market you better learn to search with something, instead of trashing the users data.
Jonke
In Slovak language, for example á, ä is very close to a. And so are all accented characters to those without accent. Lots of people don't use these at all in IM, etc.
Vojto
@Vojto: In most nordern european languages, accented charachters are far away from the unaccented versions. In fact the are symbols of very different sounds.The german öl for example (http://en.bab.la/dictionary/german-english/oel). Or the swedish words ål (eal) and al (a tree).
Jonke
Cool, I just wanted to note, that that's not necessarily true for all European languages. I mentioned Slovak, but it's the same also for Czech, Polish, Croatian I guess and pretty much all Slavic languages.And it's very important that search engines, etc. support searching by unaccented characters - because in most cases people are just too lazy to type accents.
Vojto
+6  A: 

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and remove the other chars like this:

>> "àáâãäå".chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"
unexist
This might work in Ruby 1.9, but not in 1.8.
James A. Rosen
% ruby -vruby 1.8.7 (2008-08-11 patchlevel 72) [i686-linux]
unexist
Thanks, didn't know that functionality existed in Rails. The method name was different in my Rails version: "àáâãäå".mb_chars.
d__
+1 for form KD, which will also turn ligatures like 'ffi' to 'ffi'.
Christian Campbell
I'm trying to use this in another script outside a Rails app. I thought it'd be in `activesupport`, but after requiring it I still get a `NoMethodError` for `normalize`. Do you know what I have to require?
obvio171
It is in activesupport, but you will have to do it like this: ActiveSupport::Multibyte::Chars.new("àáâãäå").mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
unexist
This works great, but I had to do `mb_chars` like Christian. `foo.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').to_s.split`
Sam Soffes
One more tip: if you get "NoMethodError: undefined method `normalize'", you may also need to explicitly set $KCODE = 'u' to force the default encoding for strings into Unicode.
jpatokal
+2  A: 

The key is to use two columns: canonical_text and original_text. Use original_text for display and canonical_text for searches. That way, if a user searches for "Visual Cafe," she sees the "Visual Café" result. If she really wants a different item called "Visual Cafe," it can be saved separately.

To get the characters in a Ruby 1.8 source file, do something like this:

register_replacement([0x008A].pack('U'), 'S')
James A. Rosen
Perhaps a nit, but the name 'canonical_text' would throw me a bit as what we're doing is lossy. I'd expect a name more like 'compatible_text' or 'decomposed_text' (although I can see the same argument against these, too). Perhaps just 'search_text'?
Christian Campbell
A: 

lol.. i just tryed this.. and it is working.. iam still not pretty sure why.. but when i use this 4 lines of code:

  • str = str.gsub(/[^a-zA-Z0-9 ]/,"")
  • str = str.gsub(/[ ]+/," ")
  • str = str.gsub(/ /,"-")
  • str = str.downcase

it automaticly removes any accent from filenames.. which i was trying to remove(accent from filenames and renaming them than) hope it helped :)

It also removes all characters that aren't alphanumeric. Which is probably not the correct behavior, even for a filename.
Chuck
A: 

For anyone reading this wanting to strip all non-ascii characters this might be useful, I used the first example successfully.

Kris