views:

668

answers:

8

We're getting ready to translate our PHP website into various languages, and the gettext support in PHP looks like the way to go.

All the tutorials I see recommend using the english text as the message ID, i.e.

gettext("Hi there!")

But is that really a good idea? Let's say someone in marketing wants to change the text to "Hi there, y'all!". Then don't you have to update all the language files because that string -- which is actually the message ID -- has changed?

Is it better to have some kind of generic ID, like "hello.message", and an english translations file?

A: 

Having produced many i18n applications most with language switching I will confirm that using a distinct key that isn't the English equivalent is the only way to do it.

Apart from the example you give, there are times when the same english phrase may need to be translated differently.

So, each unique phrase should have a unique ID. I've found that ~PHnnnn~ is my preferred method - as I can easily parse it out and do a lot of the work automatically.

For example:

 /*
  * in the following line ~PH2228~ is a placeholder and will be replaced with the 
  * heading text from the language database - the resulting output will be 
  * whatever language is currently selected by the user.
  */ 
 phrase_op("<h1>~PH2228~</h1>"); // ~PH2228~ = "Introduction to the system"

phrase_op is a simple routine that takes phrases from a database and replaces the token. I admit that the code isn't as readable - however the advantage is that the English also is editable so you get a mini CMS for free....

For reference - the core of my entire language system is two functions:

/**
 * outputs a string changing all the ~PH####~ into the corresponding phrase
 */
function phrase_op($txt)
{
    preg_match_all("/\~PH[0-9][0-9][0-9][0-9]\~/",$txt,$regs,PREG_SET_ORDER);
        foreach($regs as $result) {
        $phc = substr($result[0],1,strlen($result[0])-2);
        $ph = get_phrase($phc);
        $txt = str_replace($result[0],$ph,$txt);
    }
    echo $txt;
}

/*
 * loads the phrases from the database - called once during system load. With a system with a lot
 * of phrases (more than 3000) it may be more efficient to modify phrase_op to load each phrase as required from the DB.
 * Alternatively the table could be modified to work with modules - however initial testing during
 * development showed that this load everything method wasn't a performance issue.
 * Tested against a production server with 572 phrases.
 */
function load_phrases()
{
    global $phrases;
    global $config;

    $phrases = array();

    $sql = "SELECT * FROM tb_phrases where iso='".mysql_real_escape_string($config['cur_lang'])."'";

    $result = mysql_query($sql) or die('exec_query failed: [ '.$sql.' ]'.mysql_error());

    while ($row = mysql_fetch_array($result, MYSQL_ASSOC))
    {
        $phrases[$row['phrase_code']] = $row['text'];
    }
}
Richard Harrison
Since it's not evident from reading the code what "~PH2228~" is supposed to tell the user, how do you go about keeping the code easy to follow? By just adding a comment with a rough English equivalent of the message?
Ates Goral
I've added some comments. ~PH2228~ is a placeholder that will be expanded with the correct phrase from the database.
Richard Harrison
still the code is not readable, -1
tharkun
As everyone knows, comments and code go easily out of sync. IMHO it's just general good wisdom to keep magic numbers and codes out of your code.
Rene Saarsoo
to me the code is readable but MRDA applies. I would have thought that to anyone proficient in SQL that the code was self explanatory.. Am I really that wrong?
Richard Harrison
Am I following this correctly, if there is 100 text translations on 1 page your method would need 100 queries to the DB and also you are not using gettext? I am trying to do language translation and trying to learn the best way of doing it myself
jasondavis
My method loads all of the translations at the start via a call to load_phrases(). Each translation replacement then uses the phrases array.If performance becomes an issue then optimisations could be applied in the phrase_op routine.
Richard Harrison
+1  A: 

Haven't you already answered your own question? :)

Clearly, if you intend to support i18n of your application, you should treat all the language implementations the same. If someone decides a string needs to change, you make a similar change in all the language files. The metadata with the checkin should group all the language files together in the same change. If your "default" language is handled differently, that makes it more difficult to maintain.

David M. Karr
That's assuming you have a German, Japanese, Chinese, Arabic, etc speaker ready to translate at all times during your development cycle. That has never been my experience. On projects I have worked on, we change the original text (English), then aggregate the changes at the end of the cycle.
dcstraw
+4  A: 

The reason for the IDs being English is so that the ID is returned if the translation fails for whatever reason - the translation for the current language and token not being available, or other errors. That of course assumes the developer is writing the original English text, not some documentation person.

Also if the English text changes then probably the other translations need to be updated?

In practice we also use Pure IDs rather than then English text, but it does mean we have to do lots of extra work to default to English.

Douglas Leeder
A: 

I'd go so far as to say that you never (for most values of never) want to use free text as keys to anything. Imagine if SO used the query title as key to this page for instance. If someone links to it, and then the title is edited, the link is no longer valid.

Your problem is similar, except you would also be responsible for updating all links...

Like Douglas Leeder mentions, what you probably want to do is use English as the default (backup) language, although an interface that uses English and another language intermixed is highly confusing (but mildly amusing, too).

Berserk
Links are different than messages. If the original text of a message changes, you don't necessarily want the same translated text to appear because the meaning of the message might be different. It's better to examine the message at that point to see if retranslation is necessary.
dcstraw
Unfortunately gettext makes setting a default language **really** hard.
Douglas Leeder
+7  A: 

I use meaningful IDs such as "welcome_back_1" which would be "welcome back, %1" etc. I always have English as my "base" language so in the worst case scenario when a specific language doesn't have a message ID, I fall-back on English.

I don't like to use actual English phrases as message ID's because if the English changes so does the ID. This might not affect you much if you use some automated tools, but it bothers me. I don't like to use simple codes (like msg3975) because they don't mean anything, so reading the code is more difficult unless you litter comments everywhere.

Christopher Nadeau
+2  A: 

I strongly disagree with Richard Harrisons answer about which he states it is "the only way". Dear asker, do not trust an answer that states it is the only way, because the "only way" doesn't exist.

Here is another way which IMHO has a few advantages over Richards approach:

  • Start with using the proto-version of the English string as Original.
  • Don't display these proto-strings but create a translation file for English nontheless
  • Copy the proto-strings to the translation for the beginning

Advantages:

  • readable code
  • text in your code is very close if not identical to what your view displays
  • if you want to change the English text, you don't change the proto-string but the translation
  • if you want to translate the same thing twice, just write a slightly different proto-string or just add 'version for this and that' and you still have a perfectly readable code
tharkun
+3  A: 

Wow, I'm surprised that no one is advocating using the English as a key. I used this style in a couple of software projects, and IMHO it worked out pretty well. The code readability is great, and if you change an English string it becomes obvious that the message needs to be considered for re-translation (which is a good thing).

In the case that you're only correcting spelling or making some other change that definitely doesn't require translation, it's a simple matter to update the IDs for that string in the resource files.

That said, I'm currently evaluating whether or not to carry this way of doing I18N forward to a new project, so it's good to hear some thoughts on why it might not be a good idea.

dcstraw
A: 

In a word don't do this.

The same word/phrase in English can often enough have more than one meaning, and each meaning a different translation.

Define mnemonic ids for your strings,and treat English as just another language.

Agree with other posters that id numbers in code are a nightmare for code readability.

Ex localisation engineer