views:

409

answers:

5

Context:

I'm in the design phase of what I'm hoping will be a big website (lots of traffic, lots of users reading and writing to database). I want to offer this website in the three languages I speak myself (English, French, and by the time I finish the website, I will hopefully have learned enough Spanish to offer that too)

Dilemma:

I'm wondering how I should go about offering these various languages (and perhaps more in the future).

Criteria:

Many methods exist for designing multi-language websites. I'm looking for the technique that will result in a faster browsing experience for the user.

Choices:

Currently, I can think of (and have read about) the following choices. They are sorted in order of preference up to now.

  1. Store all language-specific strings in a database and fetch the good one depending on prefered-language (members can choose which language they prefer), browser-default-language and which language is selected during the current session, in that order.

    Pros:

    • Most of the time, a single test at the beggining of the session confirms which language to use for the remainder of the session (stored in a SESSION variable). Otherwise, a user logging in also fetches the right language and keeps it until he/she logs out (no further tests). So the testing part should be pretty fast.

    Cons:

    • I'm afraid that accessing the database all the time would be quite time-consuming (longer page load for the user), especially considering that lots of users could also be accessing the database at the same time for the same reason (getting the website text in the correct language), but also for posting comments and the such.
    • Strings which include variables (e.g. "Hello " + user.name + ", how are you?") are harder to store because the variable (e.g. user name) changes for each user.
    • A direct link to a portal for a specific language would be ugly (e.g. www.site.com?lang=es)
  2. Store all language-specific strings in a text file and fetch the good one depending on prefered-language (members can choose which language they prefer), browser-default-language and which language is selected during the current session, in that order.

    Pros:

    • Most of the time, a single test at the beggining of the session confirms which language to use for the remainder of the session (stored in a SESSION variable). Otherwise, a user logging in also fetches the right language and keeps it until he/she logs out (no further tests). So the testing part should be pretty fast.

    Cons:

    • I'm afraid that accessing the text file all the time would be quite time-consuming (longer page load for the user), especially considering that lots of users could also be accessing the file at the same time for the same reason (getting the website text in the correct language).
    • Strings which include variables (e.g. "Hello " + user.name + ", how are you?") are harder to store because the variable (e.g. user name) changes for each user.
    • I don't think multiple users could access the text file concurrently, though I may be wrong. If that's the case though, every user loading a page would have to wait for his/her turn to access the text file.
    • Fetching the very last string of the text file could be pretty long...
    • A direct link to a portal for a specific language would be ugly (e.g. www.site.com?lang=es)
  3. Creating multiple versions of the website in seperate folders, where each version is in a different language.

    Pros:

    • No extra-treatment is needed for handling languages, so no extra waiting time.

    Cons:

    • Maintaining the website will be like going to school: painfull, long, makes you stupid after doing the same thing over and over again.
    • ugly url (e.g. www.site.com/es/ instead of www.site.com)

Additionnaly, the coices above could be combined with one or more of the following techniques:

  1. Caching certain frequently requested pages (in a singleton or static PHP function?). Certain sentences could also be cached for every language.

    Pros

    • Quicker access for frequently-requested pages.
    • Which pages need caching can be determined dynamically, with time.

    Cons

    • I'm not sure about this one, but would this end up bloating the server's RAM?
  2. Rewritting the url could be used for many things.

    • A user looking for direct access to one language could do so using www.site.com/fr/somefile and would be redirected to www.site.com/somefile, but with the language selected beign stored in a session variable.

    Pros

    • Search engines like this because they have two different pages to show for two different languages

    Cons

    • Bookmarking a page doesn't mean you'll en up with the right language when you come back, unless I put the language information in the url (www.site.com/somefile?lang=fr)

A little more info

I usually user the following technologies to make a website:

  • PHP
  • SQL
  • XHTML
  • CSS
  • Javascript (and AJAX)

This being said, if a solution requires that I learn a new language or something, I'm very open to doing so. I have no deadline for this project and I do intend to learn a lot from doing it!

Conclusion:

What I'm looking for is a method that allows me to offer multiple languages while not increasing page load time and not going crazy when trying to maintain the website. If you guys/gals have other ideas I should consider, I will try adding them to my list. Another possibility is that I'm overdoing this. Maybe I won't gain enough time with these methods for this all to be worth it, I just don't know how to verify if I need to worry about this or not.. so if you have any ideas for that, it would also help me.

A: 

"I'm afraid that accessing [the database/text file] all the time would be quite time-consuming"

It would be, but that's why you'd likely be using caching to some extent. Nearly all large sites are accessing data stored outside the HTML page itself and, as such, utilize caching techniques as needed.

Your question regarding speed really is irrelevant to having multiple languages. It's an issue of storing data (content) so it's easy to maintain and present to the user. Whether it's one language or 10 the problem is the same.

DA
I think it's different from having any other form of content, because each change I make needs to be reflected in all languages.. As for the caching techniques, are there any links you could give me in this regard? And where does this cache rest? On the client's computer? On the server?
Shawn
Having to reflect the change in all languages isn't a performance issue on the front end but rather a maintenance issue for your team on the back end. Caching can happen anywhere, but for your purposes, it'd be on the server. How you'd do it would vary on a variety of things including the particular server you run, the language you wrote the app in, and the database you are using. Some examples would be your DB caching common queries in memory, such as the request for the home page data...and your web server caching pre-rendered web site pages that are frequently requested.
DA
Well, I don't know what the server will offer, but it will most probably be my friend's server, so I guess I could ask him to add what I'm missing... Other that that, I'm willing to learn what I need to get the best results... I was thinking along the following lines: MySQL and PHP (and a lot of ajax). Is that a good choice? Will I get a slower response because of the interpreted nature of PHP?
Shawn
A: 

Create the most generic form of the site as you can. Import the translation from a database, with fall back (i.e. an order of languages, if a translation does not exist then use the next best langauge (For German: German, Dutch, English etc).

You would solve performance issues by keeping caches of the dynamically created pages. [Check the dependent data and update if necessary]

The perfered language that a user would like is passed along in the HTTP request headers. Having a select language+query string would often be unnecessary.

Resource files would be one way to go. It is easier to send to translators. However it can be difficult to resuse amongst multiple websites.

Databases are convient because it is the first thing that should be backed up on a website. It also has the benefit of being fast. However, if you have an extremely database focused project, you may not want to add additional strain on your database.

monksy
That caching thing looks interesting. Any links you could give me? Also, what is this cache? The client's cache? The server's RAM? And doesn't the server automatically do this type of optimization?
Shawn
The server does not automatically do this kinda of caching you have to get a caching solution or set one up manuallly to get this. The cache is located on the server side.
monksy
Does this mean I have to alter the server's settings or something? I don't know if I will be allowed to...
Shawn
There are a few solutions that do server side extensions, a few solutions that do caching on the logic level, and the rest would be coded by yourself.
monksy
Which of these solutions should I pick, or rather, what's the difference between them?
Shawn
A: 

For my solutions I want this:

  • The language should be indicated in the URL, it works better with google indexing the page and people following the links in google's search result.
  • As much pre-generated translations as possible, for faster page-serving.

The first is quite easily done by having an URL like http://example.com/fr/and-so-on. URL rewriting can turn that into http://example.com/and-so-on?lang=fr which is potentially easier to handle.

For pre-generating translations, it is good to use a html template framework so you can generate translated templates from one set of source templates. A blunt approach is to generate a sed-script from a language key-value files, and run that sed script on each template to get a translated version.

What remains then is to translate the dynamically generated parts of the pages. There are a few tools for that java has bundles, gnu gettext is a quite nice tool.

Christian
I don't quite understand what you mean by html template frameworks. Do you mean that each page has a version in all languages and I just serve the right one? That is, you are suggesting I go with choice #3? Or did I not understand what your saying? As for translation, I want to do it all. I really hate the translator tools, they don't translate style, metaphors, etc.
Shawn
+1  A: 

Whether you use a database or a filesystem to store the translations, you should be loading the text all at once and then serving it from memory. Most applications will typically not have so much text that this becomes a problem. In Java or .Net this could be accomplished by storing the text in a singleton or static object. Then all the strings are in RAM and do not need to be loaded or parsed. If your platform does not have a convenient way to store data in ram, you could run a separate caching application such as memcached.

The rest of your concerns can be mitigated by hiding the details. Build or find a framework that lets you load your translations and then look them up by some key. If you decide to switch to files or a database later, the rest of your code is unaffected. In the short term do whichever is easier for you. I've found that it's best to have a mix: it's easier to manage application text along with the source code in a version control system. But some text changes often, or needs to change without requiring a build+deployment cycle, and that text should be in the DB.

Finally, don't build strings with substitutions in them. Use some kind of format string, because otherwise your translators will go crazy trying to translate sentence fragments.

(Warning: Java code sample)

//WRONG
String msg = "Hello, " + username + ", welcome back.";

//RIGHT
String fmt = "Hello, %s, welcome back."; // in real code: load this string from a file or the db
String msg = fmt.format(username);

Another person mentioned encoding the language in the URL. This is the preferred way to do it if you care what a search engine thinks of your site. Google recommends using different hostnames or a different subdirectory. This means that the language headers sent by the user can't be used for anything, except perhaps initially sending them to one landing page or another. You will need to determine the language for each request based on the incoming URL (this actually simplifies your code a lot later on). In Java I'd store the language code in the Request and just grab it whenever I need it.

The easiest way to handle language codes in the URL is to use re-writing. A client sends a request for www.yoursite.com/de/somepage and internally you re-write the request to www.yoursite.com/somepage and store the language identifier somewhere. In Java each request has an HttpServletRequest object where you can store attributes for the lifecycle of the request. If your framework doesn't have anything like that you can just add a parameter to the url: www.yoursite.com/de/somepage => www.yoursite.com/somepage?lang=de. If you are using hostname-based languages you can use hostnames such as de.yoursite.com or www.yoursite.de. There are pros and cons to using this approach. For one thing, using country-code TLDs means registering new TLDs and trying to figure out whether a country code is appropriate to represent a language (it's often not). Using differnet hostnames/domains means you have to consider under what domains cookies are stored. If you want a cookie-free subdomain you need to plan this carefully. But from the coding side a language-based hostname doesn't need any additional re-writing; you can read the hostname (it's the Host header in the HTTP request) and parse that to determine the language.

Mr. Shiny and New
You say I should the text all at once and then serve it from memory. Where would this memory be? Is this the server's cache? Is this the client's cache? That's indeed what I wanted to do, but I thought the only "memory" to which I had access was either a DB, a text file or simply an in-source-code array... As for having a mix of the choices I present, that's probably what I will end up with... Next you suggest using a format string... You mean building a funciton similar to C++'s printf? As for the Google recommendation, they prefer I use choice #3, is that what you saying?
Shawn
@Shawn: I edited my answer to address your points.
Mr. Shiny and New
Thanks very much. I think I understand now. I edited my post to reflect this :p But there are a few things I'm not sure about. If I store all this info in the server's RAM, won't I be bloating the RAM, thus effectively slowing the server down? And you say storing in a Java singleton or static object. Can this be done in PHP?
Shawn
Yes, this does use up a lot of RAM, however a server configured for high load will typically have lots of RAM and be dedicated to this one task, so that isn't usually a problem. It's the classic space-time trade-off. Use more RAM to get more speed.
Mr. Shiny and New
As for static objects in PHP, I'm not sure how that can be done. In Java the application server is typically running for a long time. In PHP that's not the case; individual worker processes come and go. You need some long-running process to keep the data in RAM. Memcached can do this if you can't find a native PHP way. With memcached you first check the cache server for a value, and if it's not there you get it from the primary storage such as a disk.
Mr. Shiny and New
Thanks a lot mate, I just looked up PHP's APC and it seems to be exactly what I want as far as caching goes. I was just wondering, what do you think would be the maximum amount of data I could cache before being a nuisance to the server? I'm guessing that using more than 50% of the RAM would start getting nasty... I don't ask this concerning my application, which probably won't come anywhere close to 50% of the RAM, just for general knowledge.
Shawn
@Shawn: A web server that is part of a high-performance website should prbably be dedicated to that task. So if you have unused RAM that RAM is wasted. The application I maintain has hundreds of MB of cached objects in each application instance.
Mr. Shiny and New
+1  A: 
  • Offer the initial page in a language depending on the Accept-Language HTTP header.

  • Let the user set the language in the current session and, if they're authenticated, in their user profile.

  • In your code and templates, mark strings as "translatable." You should have tools that gather all the strings from your codebase and let your translaters translate them.

  • Have a layer which loads the translations from the database either individually or as a bundle, and apply them to the page which is loading. Cache these parts to make them fast -- every page load shouldn't make a hundred calls to the database for every translatable string.

Checkout how Django does it -- it should be enlightening.

a paid nerd
Thanks, that Django link was very interesting.
Shawn