views:

115

answers:

2

If you have an application localized in pt-br and pt-pt, what language you should choose if the system is reporting only pt code (generic Portuguese)?

This question is independent of the nature of the application, desktop, mobile or browser based. Let's assume you are not able to get region information from another source and you have to choose one language as the default one.

The question does apply as well for more case including:

  • pt-pt and pt-br
  • en-us and en-gb
  • fr-fr and fr-CA
  • zh-cn, zh-tw, .... - in fact in this case I know that zh can be used as predominant language for Simplified Chinese where full code is zh-hans. For Traditional Chinese, with codes like zh-tw, zh-hant-tw, zh-hk, zh-mo the proper code (canonical) should be zh-hant.

Q1: How to I determine the predominant languages for a specified meta-language?

I need a solution that will include at least Portuguese, English and French.

Q2: If the system reported Simplified Chinese (PRC) (zh-cn) as preferred language of the user and I have translation only for English and Traditional Chinese (en,zh-tw) what should I choose from the two options: en or zh-tw?

+3  A: 

Do you expect to have more users in Portugal or in Brazil? Pick accordingly.

For your general solution, you find out by reading up on Ethnologue.

bmargulies
Thanks, this looks like a good source of information. From it I consider that `pt-br`, `fr-fr`,`zh-cn` and `en-us` could be considered predominant in their groups.
Sorin Sbarnea
+4  A: 

In general you should separate the "guess the missing parameters" problem from the "matching a list of locales I want vs. a list of locales I have" problem. They are different.

Guessing the missing parts

These are all tricky areas, and even (potentially) politically charged.

But with very few exceptions the rule is to select the "original country" of the language. The exceptions are mostly based on population. So fr-FR for fr, es-ES, etc. Some exceptions: pt-BR instead of pt-PT, en-US instead of en-GB.

It is also commonly accepted (and required by the Chinese standards) that zh maps to zh-CN.

You might also have to look at the country to determine the script, or the other way around. For instance az => az-AZ but az-Arab => az-Arab-IR, and az_IR => az_Arab_IR

Matching 'want' vs. 'have'

This involves matching a list of have vs. list of have languages. Dealing with lists makes it harder. And the result should also be sorted in a smart way, if possible. (for instance if want = [ fr ro ] and have = [ en fr_CA fr_FR ro_RO ] then you probably want [ fr_FR fr_CA ro_RO ] as result.

There should be no match between language with different scripts. So zh-TW should not fallback to zh-CN, and mn-Mong should not fallback to mn-Cyrl. Tricky areas: sr-Cyrl should not fallback sr-Latn in theory, but it might be understood by users. ro-Cyrl might fallback to ro-Latn, but not the other way around.

Some references

  • RFC 4647 deals with language fallback (but is not very useful in this case, because it follows the "cut from the right" rule).
  • ICU 4.2 and newer (draft in 4.0, I think) has uloc_addLikelySubtags (and uloc_minimizeSubtags) in uloc.h. That implements http://www.unicode.org/reports/tr35/#Likely_Subtags
  • Also in ICU uloc.h there are uloc_acceptLanguageFromHTTP and uloc_acceptLanguage that deal with want vs have. But kind of useless as they are, because they take a UEnumeration* as input, and there is no public API to build a UEnumeration.
  • There is some work on language matching going beyond the simple RFC 4647. See http://cldr.unicode.org/development/design-proposals/languagedistance
  • Locale matching in ActionScript at http://code.google.com/p/as3localelib/
  • The APIs in the new Flash Player 10.1 flash.globalization namespace do both tag guessing and language matching (http://help.adobe.com/en_US/FlashPlatform/beta/reference/actionscript/3/flash/globalization/package-detail.html). It works on TR-35 and can look beyond the @ and consider the operation. For instance, if have = [ ja ja@collation=radical ja@calendar=japanese ] and want = [ ja@calendar=japanese;collation=radical ] then the best match depends on the operation you want. For date formatting ja@calendar=japanese is the better match, but for collation you want ja@collation=radical
Mihai Nita
"en-US instead of en-US." I think the last one should be en-GB.
Thomas
Fixed (for posterity :-). Thanks Thomas.
Mihai Nita