views:

630

answers:

1

I am using Sphinx with the Thinking Sphinx plugin to search my data. I am using MySQL.

My data contains accented chars ("á", "é", "ã") and I want them to be equivalent to their non-accented counterparts ("a", "e", "a", for example) when searching and ordering.

I got the search working using a charset table (pastie.org/204316), and a search for "AGUA" returns "ÁGUA", but the ordering of the results is not working properly. In a search for "AGUA", "ÁGUA" cames after "MUITA ÁGUA", for example, but I wanted it to be sorted as if it were written with an "A", not an "Á".

The only solution I can think is index a new column containing the non-accented chars and using it for sortering, using the REPLACE (http://dev.mysql.com/doc/refman/5.4/en/string-functions.html#function_replace) mysql function to strip the accented chars, but I would need one call to REPLACE for each possible accented char (and there are many) and it seems to me a not very maintanable workaround.

Anybody know some better way to handle this issue?

Thanks!

+3  A: 

Sphinx handles sorting on string fields by storing all the values in a list, sorting the list and then storing the index of each string as an int attribute. According to the docs the sorting of this list is done at a byte level and currently isn't configurable.

Ideally the strings should be sorted differently, depending on the encoding and locale. For instance, if the strings are known to be Russian text in KOI8R encoding, sorting the bytes 0xE0, 0xE1, and 0xE2 should produce 0xE1, 0xE2 and 0xE0, because in KOI8R value 0xE0 encodes a character that is (noticeably) after characters encoded by 0xE1 and 0xE2. Unfortunately, Sphinx does not support that at the moment and will simply sort the strings bytewise.

-- from http://www.sphinxsearch.com/docs/current.html

So, no easy way to achieve this within Sphinx. A modification to your REPLACE() based idea would be to have a separate column and populate it using a callback in your model. This would let you handle the replace in Ruby instead of MySQL, an arguably more maintainable solution.

# save an unaccented copy of your title. Normalise method borrowed from
# http://stackoverflow.com/questions/522715/removing-accents-diacritics-from-string-while-preserving-other-special-chars-tri
class MyModel < ActiveRecord::Base
  before_validation :update_sort_col

  private

  def update_sort_col
    sort_col = self.title.to_s.mb_chars.normalize(:kd).gsub(/[^-x00-\x7F]/n, '').to_s
  end
end
James Healy