ansaurus

Question

Java. Ignore accents when comparing strings

Answer 1

A:

The problem with these sort of conversions is that there isn't always a clear-cut mapping from accented to non-accented characters. It depends on codepages, localizations, etc. For example, is this a with an accent equivalent to an "a"? Not a problem for a human, but trickier for the computer.

AFAIK Java does not have a built in conversion that can look up the current localization options and make these sort of conversions. You may need some external library that handles unicode better, like ICU (http://site.icu-project.org/ )

Uri 2010-03-03 16:57:41

Answer 2

+8 A:

You didn't hear this from me (because I disagree with the premise of the question), but, you can use java.text.Normalizer, and normalize with NFD: this splits off the accent from the letter it's attached to. You can then filter off the accent characters and compare.

Chris Jester-Young 2010-03-03 16:58:27

Thank you, this is just what I needed.

framara 2010-03-03 17:01:57

Answer 3

+1 A:

In Spanish, n and ñ are considered different letters. Sorts between n and o. There's even a separate keyboard key.

As far as I know, in German, "ö" should be considered equal to "oe", not "o".

How are you going to handle all that? :)

Nicolás 2010-03-03 17:04:53

I just have a list of contacts and when the users does a 'search' I just need to display a sublist of matches or possible matches. It's not a big deal

framara 2010-03-03 17:11:10

Answer 4

+5 A:

I think you should be using the Collator class. It allows you to set a strength and locale and it will compare characters appropriately.

From the Java 1.6 API:

You can set a Collator's strength property to determine the level of difference considered significant in comparisons. Four strengths are provided: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL. The exact assignment of strengths to language features is locale dependant. For example, in Czech, "e" and "f" are considered primary differences, while "e" and "ě" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical.

I think the important point here (which people are trying to make) is that "Joao"and "João" should never be considered as equal, but if you are doing sorting you don't want them to be compared based on their ASCII value because then you would have something like Joao, John, João, which is not good. Using the collator class definitely handles this correctly.

DaveJohnston 2010-03-03 17:06:56

This is a better answer than the accepted one.

Software Monkey 2010-03-03 17:59:19

@Software Monkey: I agree too, even though I wrote the accepted answer. :-P

Chris Jester-Young 2010-03-03 19:39:37

ansaurus

tags:

views:

answers:

Java. Ignore accents when comparing strings

related questions