views:

2179

answers:

10

EDIT: I would really like to see some general discussion about the formats, their pros and cons!

EDIT2: The 'bounty didn't really help to create the needed discussion, there are a few interesting answers but the comprehensive coverage of the topic is still missing. Six persons marked the question as favourites, which shows me that there is an interest in this discussion.

When deciding about internationalization the toughest part IMO is the choice of storage format.

For example the Zend PHP Framework offers the following adapters which cover pretty much all my options:

  • Array : no, hard to maintain
  • CSV : don't know, possible problems with encoding
  • Gettext : frequently used, poEdit for all platforms available BUT complicated
  • INI : don't know, possible problems with encoding
  • TBX : no clue
  • TMX : too much of a big thing? no editors freely available.
  • QT : not very widespread, no free tools
  • XLIFF : the comming standard? BUT no free tools available.
  • XMLTM : no, not what I need

basically I'm stuck with the 4 'bold' choices. I would like to use INI files but I'm reading about the encoding problems... is it really a problem, if I use strict UTF-8 (files, connections, db, etc.)?

I'm on Windows and I tried to figure out how poEdit functions but just didn't manage. No tutorials on the web either, is gettext still a choice or an endangered species anyways?

What about XLIFF, has anybody worked with it? Any tips on what tools to use?

Any ideas for Eclipse integration of any of these technologies?

A: 

you can use INI if you want, it's just that INI doesn't have a way to tell anyone that it is in UTF8, so if someone opens your INI with an editor, it might corrupt yout file.

So the idea is that, if you can trust the user to edit it with a UTF8 encoding.

You can add a BOM at the start of the file, some editors knows about it.

What do you want it to store ? user generated content or your application ressources ?

CiNN
I want the ini to store the language strings. I would then have one ini for each language and each module. like default.en, default.de, default. fr ...
tharkun
then you can use a simple INI, you just need to state in your doc that the translation files NEED to be in UTF8. And if the translator does not, it's his fault :)
CiNN
I did a variant of this (e.g. an INI-type file per language), and loaded it into a custom hashtable. It was fast and worked nice, except for working around some home-grown OO in the C app.
torial
+10  A: 

POEdit isn't really hard to get a hang of. Just create a new .po file, then tell it to import strings from source files. The program scans your PHP files for any function calls matching _("Text"), gettext("Text"), etc. You can even specify your own functions to look for.

You then enter a translation in the appropriate box. When you save your .po file, a .mo file is automatically generated. That's just a binary version of the translations that gettext can easily parse.

In your PHP script make a call to bindtextdomain() telling it where your .mo file is located. Now any strings passed to gettext (or the underscore function) will be translated.

It makes it really easy to keep your translation files up to date. POEdit also has some neat features like allowing comments, showing changed and dropped strings and allowing fuzzy matches, which means you don't have to re-translate strings that have been slightly modified.

Josh
A: 

I worked with two of these formats on the l18n side: TMX and XLIFF. They are pretty similar. TMX is more popular nowdays, but XLIFF is gaining support quickly. There was at least one free XLIFF editor when I last looked into it: Transolution but it is not being developed now.

Nemanja Trifunovic
A: 

One rather simple approach is to just use a resource file and resource script. Programs like MSVC have no problem editing them. They're also reasonably friendly to other systems (and to text editors) as well. You can just create separate string tables (and bitmap tables) for each language, and mark each such table with what language it is in.

Brian
A: 

None of those choices looks very appetizing to me.

If you're sending files out for translation in multiple languages, then you want to be able to trust that the encodings are correct, especially if you no one in your team speaks those languages. Sometimes it's difficult to spot an encoding problem in a foreign language, and it is just too easy to inadvertantly corrupt file encodings if you let your OS 'guess'.

You really want a format that declares its encoding. Otherwise, translators or their translation tools might select something other than UTF-8. For my money, any kind of simple XML format is best, but it looks like you'd need to roll your own in Zend. XLIFF and TMX are certainly overkill.

A format like Java's XML resources would be ideal.

Mike Sickler
Why would you need to roll your own? Have you used ZF?
sims
+2  A: 

There is always Translate Toolkit which allow translating between I think all mentioned formats, and preferred gettext (po) and XLIFF.

Jakub Narębski
A: 

I do the data storage myself using a custom design - All displayed text is stored in the DB.

I have two tables. The first table has an identity value, a 32 character varchar field (indexed on this field) and a 200 character english description of the phrase.

My second table has the identity value from the first table, a language code (EN_UK,EN_US,etc) and an NVARCHAR column for the text.

I use an nvarchar for the text because it supports other character sets which I don't yet use.

The 32 character varchar in the first table stores something like 'pleaselogin' while the second table actually stores the full "Please enter your login and password below".

I have created a huge list of dynamic values which I replace at runtime. An example would be "You have {[dynamic:passworddaysremain]} days to change your password." - this allows me to work around the word ordering in different languages.

I have only had to deal with Arabic numerals so far but will have to work something out for the first user who requires non arabic numbers.

I actually pull this information out of the database on a 2 hourly interval and cache it to the disk in a file for each language in XML. Extensive use of the CDATA is used.

There are many options available, for performance you could use html templates for each language - My method works well but does use the XML DOM a lot at runtime to create the pages.

John
A: 

This might be a little different from what's been posted so far and may not be exactly what you're looking for, but I thought I would add it, if for nothing else but a different approach. I went with an object-oriented approach. What I did was create a system that encapsulates language files into a class by storing them in an array of string=>translation pairs. Access to the translation is through a method called translate with the key string as a parameter. Extending classes inherit the parent's language array and can add to it or overwrite it. Because the classes are extensible, you can change a base class and have the changes propagate through the children, making more maintainable than an array by itself. Plus, you only call the classes you need.

VirtuosiMedia
A: 

We just store the strings in the DB and have a translator mode built into the application to handle actually adding strings for different languages.

In the application we use various tricks to create text ids, like

£("btn_save")
£(Order.class,"amt")

The translations is loaded from the db when the system boots, or when a reload is manually triggered. The £ method takes care of looking up the translated string according the the language specified in the user session.

John Nilsson
A: 

Hello ,

you can check my l10n tool called iL10Nz on http;//www.myl10n.net

You can upload po/pot files, xliff, ini files , translate, download.

you can also check out this video on youtube http://www.youtube.com/watch?v=LJLmxMFxaxA

Thanks Olivier