views:

1104

answers:

11

I've read Joel's article on Unicode and I feel that I have at least a basic grasp of internationalization from a character set perspective. In addition to reading this question, I've also done some of my own research on internationalization in regards to design considerations, but I can't help but suspect that there is a lot more out there that I just don't know or don't know to ask.

Some of the things I've learned:

  • Some languages read right-to-left instead of left-to-right.
  • Calendar, dates, times, currency, and numbers are displayed differently from language to language.
  • Design should be flexible enough to accommodate a lot more text because some languages are far more verbose than others.
  • Don't take icons or colors for granted when it comes to their semantic meaning as this can vary from culture to culture.
  • Geographical nomenclature varies from language to language.

Where I'm at:

  • My design is flexible enough to accommodate a lot more text.
  • I automatically translate each string, including error messages and help dialogs.
  • I haven't come to a point yet where I've needed to display units of time, currency or numbers, but I'll be there shortly and will need to develop a solution.
  • I'm using the UTF-8 character set across the board.
  • My menus and various lists in the application are sorted alphabetically for each language for easier reading.
  • I have a tag parser that extracts tags by filtering out stop words. The stop words list is language specific and can be swapped out.

What I'd like to know more about:

  • I'm developing a downloadable PHP web application, so any specific advice in regards to PHP would be greatly appreciated. I've developed my own framework and am not interested in using other frameworks at this time.
  • I know very little about non-western languages. Are there specific considerations that need to be taken into account that I haven't mentioned above? Also, how do PHP's array sorting functions handle non-western characters?
  • Are there any specific gotchas that you've experienced in practice? I'm looking in terms of both the GUI and the application code itself.
  • Any specific advice for working with date and time displays? Is there a breakdown according to region or language?
  • I've seen a lot of projects and sites let their communities provide translation for their applications and content. Do you recommend this and what are some good strategies for ensuring that you have a good translation?
  • This question is basically the extent of what I know about internationalization. What don't I know that I don't know that I should look into further?

Edit: I added the bounty because I would like to have more real-world examples from experience.

+5  A: 
  • My menus and various lists in the application are sorted alphabetically for each language for easier reading.

lists should be sorted, menus shouldn't. keep in mind that a given user might want to use your application in more than one language, he should still find everywhere in the same place.

the same with shortcuts, if you have any: do not translate them.

also, remember that internationalization and translation are two very different things, manage them separately.

Javier
That's a good point about the menus. By shortcuts, are you referring to keyboard shortcuts?
VirtuosiMedia
+1 for "don't translate the keyboard shortcuts" alone
peterchen
+35  A: 

Our game Gemsweeper has been translated to 8 different languages. Some things I have learned during that process:

  • If the translator is given single sentences to translate, make sure that he knows about the context that each sentence is used in. Otherwise he might provide one possible translation, but not the one you meant. Tools such as Babelfish translate without understanding the context, which is why the result is usually so bad. Just try translating any non-trivial text from English to German and back and you'll see what I mean.

  • Sentences that should be translated must not be broken into different parts for the same reason. That's because you need to maintain the context (see previous point) and because some languages might have the variables at the beginning or end of the sentence. Use placeholders instead of breaking up the sentence. For example, instead of

"This is step" "of our 15-step tutorial"

Write something like:

"This is step %1 of our 15-step tutorial"

and replace the placeholder programmatically.

  • Don't expect the translator to be funny or creative. He usually isn't motivated enough to do it unless you name the particular text passages and pay him extra. For example, if you have and word jokes in your language assets, tell the translator in a side note not to try to translate them, but to leave them out or replace them with a more somber sentence instead. Otherwise the translator will probably translate the joke word by word, which usually results in complete nonsense. In our case we had one translator and one joke writer for the most critical translation (English).

  • Try to find a translator who's first language is the language he is going to translate your software to, not the other way round. Otherwise he is likely to write a text that might be correct, but sounds odd or old-fashioned to native speakers. Also, he should be living in the country you are targeting with your translation. For example a German-speaking guy from Switzerland would not be a good choice for a German translation.

  • If any possible, have one of your public beta test users who understands the particular translation verify translated assets and the completed software. We've had some very good and very bad translations, depending on the person who provided it. According to some of our users, the Swedish translation was total gibberish, but it was too late to do anything about it.

  • Be aware that, for every updated version with new features, you will have to have your languages assets translated. This can create some serious overhead.

  • Be aware that end users will expect tech support to speak their language if your software is translated. Once again, Babelfish will most probably not do.

Edit - Some more points

  • Make switching between localizations as easy as possible. In Gemsweeper, we have a hotkey to switch between different languages. It makes testing much easier.

  • If you are going to use exotic fonts, make sure these include special characters. The fonts we chose for Gemsweeper were fine for English text, but we had to add quite a few characters by hand which only exist in German, French, Portughese, Swedish,...

  • Don't code your own localization framework. You're probably much better off with an open source framework like Gettext. Gettext supports features like variables within sentences or pluralization and is rock-solid. Localized resources are compiled, so nobody can tamper with them. Plus, you can use tools like Poedit for translating your files / checking someone else's translation and making sure that all strings are properly translated and still up to date in case you change the underlying source code. I've tried both rolling my own and using Gettext instead and I have to say that Gettext plus PoEdit were way superior.

Edits - Even More Points

  • Understand that different cultures have different styles of number and date formats. Numbering schemes are not only different per culture, but also per purpose within that culture. In EN-US you might format a number '-1234'; '-1,234' or (1,234) depending on what the purpose of the number is. Understand other cultures do the same thing.

  • Know where you're getting your globalization information from. E.g. Windows has settings for CurrentCulture, UICulture, and InvariantCulture. Understand what each one means and how it interacts with your system (they're not as obvious as you might think).

  • If you're going to do east Asian translating, really do your homework. East-Asian languages have quite a few differences from languages here. In addition to having multiple alphabets that are used simultaneously, they can use different layout systems (top-down) or grid-based. Also numbers in east Asian languages can be very different. In the en-US you only change systems for limited conditions (e.g. 1 versus 1st), there are additional numeric considerations besides just comma and period.

Adrian Grigore
Great information, thank you.
VirtuosiMedia
The tech support issue is one that I've thought of as well. We don't have the resources at this point to provide support in as many languages as we could probably translate. How do you suggest handling that?
VirtuosiMedia
My idea was to state that official support is in English only, but then provide a forum for each language so that the community could answer questions as well.
VirtuosiMedia
I guess it depends. We can still get away with it because we create games that require almost not tech support whatsoever. Plus, sales questions can be covered by our registration services (Avangate, ShareIt). Forums might work, but only if you can reach a critical mass of users.
Adrian Grigore
+1 for some pertinent points. I would add that if your application is domain specific, make sure that your translator has a knowledge of that domain as well. The point about the translation direction is good as well. I'm English, I work in french, but I'm not qualified to translate to French.
MatthieuF
That's also a good point, Matthieu.
VirtuosiMedia
+4  A: 

A thing about numbers: in English, as I understand, you just use a singular with 1 and plural with 2 or more. Like: “You have 1 message”; “2 messages”; “3... messages”. In Russian, these things get more complicated. You use singular for 1, 21, 31, 41... 101, 121 (so, for everything ending with 1 except when it ends with 11). Then you use singular genitive case for 2, 3, 4; 22, 23, 24; 32, 33, 34... 102, 103, 104; 122, 123, 124. And in all other cases you use plural genitive case.

It’s not really hard to implement. What is hard though is to implement something that will know how to deal with any a priori unknown language with all its weirdness :-)

And that’s just numbers :-)

Ilya Birman
That's quite an interesting gotcha and one that I would never have thought to ask about. Thanks.
VirtuosiMedia
Incidentally, I have no idea as to how I would solve that.
VirtuosiMedia
Me too. Even if you provide a callback for that, so that “language pack” can implement any logic for you — who knows how many callbacks and at what places will you need.
Ilya Birman
similarly for first second third or 1st 2nd 3rd - the abbreviations differ by language also.
Greg Domjan
In Russian these are all -й :-) (1-й, 2-й, 3-й for первый, второй, третий)
Ilya Birman
+4  A: 

One thing I've learned the hard way: if you have several files that need to be translated, include an extra tag in the name, so that later you can search your whole folder for that tag.

e.g. instead of naming a file 'sample-database.txt' name the english version 'sample-database-loc-en.txt', the italian version 'sample-database-loc-it.txt

We're doing that and it certainly does make it a lot easier.
VirtuosiMedia
+1  A: 

Yes, this is a massive subject. Getting it right is an awful lot of work.

In my program I use an integer key for every piece of text and look it up in a file as needed depending on the language. There's no literal strings anywhere in the code, only keys. I define them with an "enum" in C++ so I'm not actually typing numbers. I wrote a utility to synchronize the various language files when I add more enums and the translators fill in the blanks.

Each key also has an associated tooltip, image, keyboard shortcut, etc.

As for times and dates ... again, this is much more complex than you might think but doesn't PHP handle this for you? (I don't know, I'm a C++ guy...)

Jimmy J
+4  A: 

I would like to make the following comments (these are from our company guidelines where our class-1 products are translated in 31 different locales). Following these rules has given us (our development team rather than the whole company) the greatest productivity in translation.

  • Don't reuse snippets. For example, don't think that because you have the two errors "You selected the wrong menu item" and "That menu item is not yet available", you can extract "menu item" into a separate item and use it in both places. All messages should be self contained as their translations may change based on context.
  • Use a professional translator knowledgeable about tech. If you go near BabelFish, you're going to get everything you deserve. For example, "Microsoft Windows" is "Microsoft Windows" everywhere on the planet, it doesn't become "Microsoft Fenster" in Germany.
  • Try not to embed variables within your messages (messages such as "The %1 has failed") since positions and, indeed, gender may change: "La Table est rubbish" vs. "L'Homme est drunk". Better to use a generic noun with appended parameters: "The item has failed [%1]".
  • Only translate things which the user is expected to see. Log messages in a log file that only you will use, should be in English (or your native language), not translated to Swahili that you couldn't read anyway.
  • Menus should be sorted by functionality, not collating order.
  • Translatable units should be stored outside the code and loaded in at runtime. This makes translation an issue of just shipping off the language file, not trying to shoehorn changes into the middle of code. It also makes adding other languages easier in future.

That's enough for now. Better to stop before you all fall asleep :-)

paxdiablo
+7  A: 

When we worked on the i18n/l10n issues of Dreamfall and Age of Conan, we came across a few issues that are worth keeping in mind. Some of these we solved, some were solved for us, and some we worked around. Some we never solved...

  • Make sure all your tools and all your code supports all the charsets you want to use, and double check that assumption twice during the course of the project and a couple more times to be sure.

  • Make sure you use a font that supports all the languages you want to use. Most fonts that claim to be unicode are only unicode in the sense that the characters it has is at the correct codepoint. It does not mean it has usable characters for all codepoints.

  • Text-wrapping is not only done at spaces, as some languages don't use space to separate words (chinese comes to mind). Make sure your text-wrapping routines handles text without any spaces at all.

  • Handling plural correctly is tricky in the easy cases, and damned hard in the hard cases. Make sure you know enough about the languages you'll be using to be able to write code to handle the plural issue correctly. Keep in mind that english (and the other "western" languages are among the easy ones.

  • Never break sentences and build strings with them to fit a variable, as the variable might be placed elsewhere in the sentence in a different language. Use placeholders.

  • Keep in mind that for some languages, the value of the placeholder might change how to write the sentence. Grammar is hard. Make sure you have a plan for dealing with it. (Specifically, make sure you have a way to classify the values you use in the placeholders according to gender, time, etc).

Epcylon
Interesting point about the fonts, one that I've never heard brought up before. Is there a resource that you'd recommend for choosing a font that supports international characters? Great point about placeholders as well.
VirtuosiMedia
Alan Wood's Unicode resources (http://www.alanwood.net/unicode/index.html) lists fonts that are unicode compatible, and which characters they support. It might be that you have to pick a couple fonts, and combine them, depending on your needs (the most complete ones don't always look so good).
Epcylon
+3  A: 

My first answer in StackOverflow, so pardon if some stupid was said.

From my experience:

  • PHP: gettext has been extremely helpful;
  • non-western languages: UTF-8 everywhere (code, DB) and so far we're doing well;
  • Are there any specific gotchas that you've experienced in practice? Breaking long paragraphs for i18n into different sentences can be less expensive to translate, if the string is repeated more than once in the site you only need to have it translated once. But, be careful, if you fragment the text too much translators will lose context;
  • I've seen a lot of projects and sites let their communities provide translation for their applications and content. Do you recommend this and what are some good strategies for ensuring that you have a good translation? If you have a very large number of volunteers go for it, but depending on how much text you have, you might really need a ton of volunteers. Always make sure also that you have someone you trust being the leader of a language project to be the proof-reader controlling the accuracy of the translation.
Danilo OpenID
+1  A: 

PHP represents strings internally as byte-streams, and assumes iso-8859-1, for the cases where the encoding matters. For the most part, you can just use UTF-8 all over the place, and you'll be fine. One gotcha, if your site takes input from its users, is that you can never be 100% sure that they are submitting content in the proper encoding. You might want to use mb_detect_encoding to verify input, or use a hidden field with "exotic" characters to verify against.

Be aware that all string-related functions in PHP, that work on a character-basis, assume that character = byte. That means that you generally can't trust string functions. Have a look at this page for more details.

Another good resource for PHP, is Nick Nettleton's cheatsheet.

A subject that is very closely related to charsets/encodings, is collation. You need your collations to match the language/culture that you are working with. At least in MySql (probably in other RDBMS'es as well), you can specify the collation on different levels, such as per-database, per-table, per-column and even in the query itself.

troelskn
Those are some great resources, thank you.
VirtuosiMedia
+3  A: 

I don't have a whole lot to add to the great answers so far, but here are a few things to consider and to check.

  • Don't make assumptions. This is the catch all rule. It is easy to assume things that are region or language specific and it is hard to notice these assumptions.
  • Be very careful with string comparisons. There are some languages, such as Turkish, which have letters that are similar to others visually but which are different.
  • Use pseudo translation as a smoke test. If you read your translated strings from a resource file, create a pseudo translated version of the file that is still understandable to you but which stresses the capacity and capability of every translatable string in the application. For example, pad out a string like "Cancel" with something like "CancelXXXX!" so that it is as wide as your allowance for translated strings. Then you can test to verify that every string will display fully. Extra credit for also sticking in the most complex character likely to be rendered to verify that it displays correctly in all places.
  • Don't make assumptions about keyboard layouts. "ASDW" may be a great control set of directional keys for QWERTY keyboards, but hard coding that makes it unfriendly, if not impossible, to use for people with other keyboard layouts.
  • Test various date settings, then test them again. I have seen issues due to something as small as a different format for "AM/PM" in regional settings. The mm/dd/yyyy vs. dd/mm/yyyy also comes up a lot, but every setting here can matter.
  • Test various number formats, then test them again. You do not want to depend on decimal or thousands separators, for example.
  • Test with and without a user logged in to the server. This may be more Windows specific, but it is very easy to get a component on the server configured such that it uses the logged in user's regional settings while a user is logged in and a default regional setting when the user is not logged in. This can cause strange, intermittent behavior.
  • Test with various regional and language settings. As an example, not only does Windows have regional and language settings, but IE has its own language setting. The behavior of an IE client with en-us listed first may not always be the same as one with en-nz listed first, for example.
  • Make sure your translator understands the business and the languages, then cross check with someone else. Be very careful any time you use application specific terminology. If your program uses specific words to mean something special in the application, make sure they are translated in a similar way in every instance, including in the help text. If you have specific language targets, you might even go so far as to translate such words ahead of time and make sure they don't translate poorly in the target languages. This is more of a product research thing, but it can make a difference in what words are used in the interface, and it is easier on everyone if those words are in place from the beginning. You also want to avoid idioms that may not translate well.

Okay, I had more to say than I thought...

Malachi
+2  A: 
  • Collation/sorting rules can differ wildly between languages: ä is sorted differently in German than it is in Swedish. So sorting needs to be culture-specific.
  • Upper/lowercasing can hold suprises: The German "sharp S" character ß does not have an uppercase version, and is either transformed to "SS", or stays lowercase if exactness is important. Turkish has a dotless lowercase i and an uppercase dotted I.
  • For multilingual web apps, think carefully about how to decide what version to show and how to work it into the URL. The user should always be able to manually choose the language, and you want search engines to find different language versions under different URLs.
  • Some East Asian languages (namely Japanese and Chinese, maybe others) don't have spaces between words
  • Japanese (maybe others too) has separate versions ("full width") of arabic digits and space, and even two versions of some of its own characters (half-width and full-width katakana).
Michael Borgwardt