views:

48

answers:

1

I have read the following from Collator's Javadoc.

"The exact assignment of strengths to language features is locale dependant. For example, in Czech, "e" and "f" are considered primary differences, while "e" and "ê" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical."

Does this mean that I should set the STRENGTH based on the language I am using? If so can someone suggest the defaults for the locales: us_en, us_es, ca_fr, spain_spanish, chile_spanish, portuguese

+1  A: 

It really depends on what you're trying to do. The following is true for most (all?) languages that use the Latin alphabet:

  • Primary
    • Different: a, á, Á, b
    • Same: á, â
    • Same: a, A
  • Secondary
    • Different: a, á, Á, b
    • Different: á, â
    • Same: a, A
  • Tertiary
    • Different: a, á, Á, b
    • Different: á, â
    • Different: a, A
  • Identical
    • Also consider differences you can't see, for example between (accented A) and (A) + (accent)

There will be slight variations between languages, but in essence:

  • If you want case-sensitive comparison, use Tertiary.
  • For case-insensitive comparison, use either Primary or Secondary depending on whether you want á to be grouped with â.
  • Some of the collation rules are quite strange. a is different from á even in Primary, and á is different from Á even in Primary/Secondary. I don't know why; bug, maybe?
  • Who knows what happens in non-Latin languages.
Johannes Sasongko