views:

121

answers:

6

So far there've been several questions regarding this, and they've all come down to the same answer: one table for the language-neutral data, 1-* to a table with the translations and an indexed language ID field.

This has several problems:

  1. Twice as much CRUD.
  2. Need for Ajax CRUD if you want a decently friendly web UI.
  3. More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.
  4. Collation differences between languages isn't accommodated.
  5. Queries require joins.
  6. If you want slugs in multiple languages, oh boy.

A lot of database people have worked on all sorts of theoretical and practical problems, but surprisingly few people work on this one.

I think what we need ultimately is:

  • A field type that'll store multiple versions of strings
  • Multiple indices for each such field, one for each language or variation, with the option to specify the correct collation mode
  • A standard ORM object for this crazy thing
  • UI elements

Overkill? Sure, maybe, but the whole problem is a real nightmare as it is. And it's not exactly an uncommon scenario.

We gotta try to convince server vendors to work on this.

Edit: By the way, this is my first time using the community wiki; hopefully I'm doing it right.

Edit 2: Something about my wording seems to have made people think that I'm attacking the very concept of DBMS. I'm not; I'm simply saying that built-in support for localization is a much-needed feature.

I probably shouldn't have mentioned performance; it's of course completely negligible most of the time. The focus of my concern is on the fact that this really stifles productivity.

I'll provide an example. Suppose I have a very trivial table for a decidedly trivial store:

Products (id, price, description, name, slug)

In EF/MVC, I'd throw this in the ORM designer, maybe encapsulate it in a repository, build a Products controller, and have actions for Index, Details, Create, Update, Edit and Delete. To identify a product in any of the items, I'd simply do a WHERE(slug = @slug). I'd make a view model for the create/edit actions, design the form control, and wire it up straight to the repository. Done and done. To access the details for a product, the user would go to /products/details/product-slug.

But then since the rest of the website is bilingual, I decide to change the products table accordingly.

Products (id, price)
ProductsText (productId, language, description, name, slug)

Hey, that's not so bad. Yeah, not yet. Then you write your relationships and your constraints, and then you write you write out all your properties in the view-model, and then you make a complete CRUD controller for the ProductsText data or use jQuery/Ajax to add create/update/edit buttons on your Products controller, and then you add validation logic to make sure the user enters at least the primary language, and then when you want to read data for the end-user pages you write another query to take join ProductsText.slug and ProductsText.language with Products... I probably missed something, but you get the idea.

The complexity of the program just explodes with boilerplate code once you have localization involved.

Of course, I don't expect the problem to be solved completely, and it's obviously just as much a UI problem as it is a database problem. But there's just so much that could be done to make all this easier. A "multistring" field type might be a really good start.

Edit 3: Anyone ever hear of SQL Server Modeling Services? It has some localization tools in it that could be a step in the right direction. Still CTP though.

-- Simulate the French locale with the SET LANGUAGE statement.
SET LANGUAGE French
select Id, CountryName, 
   [System.Globalization].[SessionsString](CountryName, 1) as CountryNameString
from [Location].[CountriesTable]
+3  A: 

What is a localized database field?

Typically in applications we've worked in, the UI is localized. This is accomplished using a database, and we put all the translations (and potentially the master phrases) in the table with a locale-code and phraseid being the primary key. This is fairly straightforward, requires a single reusable set of stored procs and has good performance and the usage is well-understood. We often allow translation on the fly so that the app interface includes a translation feature where corrections can be made and other users will see them live - either rich forms applications or web forms applications (depending on caching - which is another key feature of UI localization)

As far as querying requiring joins - that's just a fact of life in a normalized relational database, and performance there is usually managed with a good normalized design and proper indexing.

In other "data", it has made little sense to localize except under direction of the application requirements. For instance, even though you may offer a product in multiple countries, the SKU and distributor may be different. This level of localization is very application specific and we often dealt with it as a separate database and there really isn't anything tying those individually country database together - many products were not available although there may have been equivalent products in the other countries.

If you are selling the same products around the world, then you kind of fall into the original scenario in a kind of multi-lingual CMS. This requires significant work besides the low-level database. For instance, if someone corrects the default product description, what flags the translators that the translations need to also be corrected? These questions are non-trivial. Although I can see where database vendors could assist with features, these are intrinsic difficulties of application requirements and design and not necessarily something the database can add features which will universally solve.

The collation issue is indeed a little awkward. Typically data is stored in nvarchar and you would not know the collation you wanted for retrieval at the time you wrote the stored proc, since the locale would be a parameter. This only affects collections retrieved which need to be ordered by content, not usually natural key and certainly not retrieval by key - it's not a large problem, but is one which cannot easily be handled without dynamic SQL (casting using the preferred collation from a table depending upon the location passed in, if you mix data from different locales, you would have to decide if you want to sort by locale first and then it may be difficult to pick a collation which might work properly within all locales in the same result set). You are probably going to want to use a Windows collation with such a wide variety of data.

Similarly with ORMs, we typically treated the composite unique key of locale/phraseid as the key to retrieve objects (we typically also had a surrogate identity primary key) - I know that traditional ORMs don't necessarily like this departure from retrieval by a meaningless surrogate key.

Cade Roux
Yeah, I understand that a translator needs to be informed when one language changes and all that. More difficult is the fact that a translator needs to be available when that happens; professional translators aren't cheap. On the other hand there's some room for lenience there, and it's something that can be handled by someone other than the techie (i.e. me). And I'm not suggesting that all this is impossible to do or that joins are such a terrible problem, but it's just one of those yucky common boilerplate things that you'd think someone would have abstracted by now.
Rei Miyasaka
A: 

In my experience I haven't commonly run into the problem where the data stored in the database has many language-dependent versions of the same text. Typically a developed application will have many language files for all the text that's more or less statically built into the application. Then we see database data for text users enter. While an application may be used by users with many different languages, the situation where users type the same text in multiple languages is not so common. Typically uses of an application will show the UI in their language and then enter and view data in their language.

For example, users of our application in the US vs in Netherlands or Saudi Arabia would see the UI in the language of their choice, but for any given installation, the data they enter will consistently be in their native language.

Obviously this doesn't apply to all cases. CRMs are an example where you would have the same text with multiple translations, like Wikipedia, but I think what I described above is the more common scenario.

Sam
I must be stuck in no-man's-land then. Our content itself is multilingual, not just the website UI. Never thought of that as being all that uncommon, especially in bilingual communities. Eastern Canada has this problem a lot.
Rei Miyasaka
+4  A: 

I've encountered all of these issues for localized CRM-style web sites. Not fun to design and optimize, but it can be done. My 2¢ worth:

1. Twice as much CRUD.

This depends on how your CRUD is designed. Any of my stored procedures or functions that can retrieve a possibly-localized field take a locale/culture code parameter. All of these fields are also NVARCHAR to avoid encoding issues.

2. Need for Ajax CRUD if you want a decently friendly web UI.

I suppose so, but this is application-dependent. Should defer to the "internal" CRUD (DRY principle).

3. More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.

This also assumes that all content is required in all supported locales, instead of using a fallback mechanism. For example, Microsoft's MSDN content is available in multiple locales, but some is in only one (generally this is US English, the "neutral" locale for Microsoft).

For a CRM-style system, any locale can be used for the initial content as long as the fallback uses that if the neutral content is not available.

4. Collation differences between languages isn't accommodated.

I find that it is easier to put all collation support at the UI/reporting layer. Multilingual-aware tables with collation/locale specified on a row-by-row basis would be a very nice-to-have feature but I wouldn't like to wait for it to become available...

5. Queries require joins.

Yes, definitely makes the query a bit more complicated :-) but no real way around that. Can get even more complicated if locale fallback is included (a "locale specificity" ranking field helps here).

6. If you want slugs in multiple languages, oh boy.

This is the reason that the .NET replacement parameters in the format string were designed to be indexed, not positional (printf(), etc. are positional). An English format may need replacements in 1, 2, 3 order, while the German equivalent uses 3, 1, 2.

To make life easier for localizers, whenever I create a .NET resource bundle I document the parameters including index, data type (including minimum and/or maximum string lengths), and a contextual description - context is important for determining text gender in some locales.

Plurality may also require multiple related resources as some locales need more than just "single" and "plural" (e.g. "0 files", "1 file", "2 files").

The same rules must apply to any localizable column in the database.

devstuff
+1  A: 

Well the answers are not that helpfull so far. I had the same problem on various projects I was doing in the past. And there was never a shortcut nor a solution out of the box that helpped me to solve this problem in a easy way. But your approach is going into the right direction and with a little work on your Data Access Layer you can actualy abstract all the burden that is caused by this requirement.

So for Metadata like Types, Categories, Countries etc. performance is not an issue since the whole stuff can be cached. For freetext entries it is a different story. You most probably can't cache them and they tend to be quite long.

You might already know those pages:

http://www.codeproject.com/KB/aspnet/LocalizedSamplePart2.aspx http://www.sisulizer.com/online-help/DatabaseLocalization.shtml http://stackoverflow.com/questions/258483/best-practices-for-localizing-a-sql-server-2005-2008-database

Yves M.
Actually I hadn't seen some of those pages. Thanks for that, much appreciated.Yeah, I sorta listed the performance issue since it *does* technically negligibly impede performance, but in retrospect it might have been a bad idea to mention it.
Rei Miyasaka
"but in retrospect it might have been a bad idea to mention it." Performance is always something one has to think about when doing the desing ;-)
Yves M.
A: 

"A lot of database people have worked on all sorts of theoretical and practical problems, but surprisingly few people work on this one."

That's because there is nothing to work on, from a theoretical perspective, in your example. The so-called "problems" you mention are, all of them, nothing more than a direct consequence of the fact that you are managing more data.

"Twice as much CRUD."

And why is that a problem ? I know of at least a few systems I built that had a lot more of that than your example.

"Need for Ajax CRUD if you want a decently friendly web UI."

Is that really so ? I don't know, but at any rate how data is handled in the presentation layer, is no concern of the DBMS, and if the programmer thinks it is too difficult/cumbersome, then don't blame the DBMS for that.

More than twice the validation -- you need to ensure that the relationship is 1-* rather than 0-*.

And why is that a problem ? If more business rules are stated, more validation is required.

"Collation differences between languages isn't accommodated."

How so ? What is the sense of collating English text with French ? Of English text with Ukrainian or Russian or Chinese ? Or did you mean something else ?

"Queries require joins."

And why is that a problem ?

"If you want slugs in multiple languages, oh boy."

In what context ? For what purpose ?

SELECT language,nllabel FROM ... NATURAL JOIN (SELECT 'EN' as language UNION SELECT 'FR' as language)

Oh but wait, I forgot ... JOINs are also a problem.

Erwin Smout
There's a reason I say *practical* -- obviously none of this is an issue theoretically, but it's a practical problem in that you're writing a lot of boilerplate code for a common scenario. A lot of people could benefit from this being simplified. When you end up having to make extra validation, extra CRUD and an extra layer of object abstraction for an otherwise completely trivial table, you end up wasting lots of time. I don't know why you're being so defensive about DBMS. Are you afraid that additional features are going to pollute the theoretical purity of SQL? As if it had any left?
Rei Miyasaka
+1 to Erwin. Who downvoted this answer?
Mark Bannister
I did, after he posted a second answer rather than responding here. I was going to vote down his second one, but it was a lot more level-headed (even though he seems to have changed his opinion half way through), so I voted this one down instead.This post misses the point completely.
Rei Miyasaka
A: 

"and it's obviously just as much a UI problem as it is a database problem."

I disagree that it is. When looking at your problem from a database angle, there are two things that might possibly be a small beginning of a solution :

the possibility to do full view updating (both through JOIN and through GROUP, for your case). the possibility to have attributes of type 'table' inside database tables. You could then have the entire set of applicable localized names-stuff as a sinle attribute in a single row for your product/...

As for full view updating : don't hold your breath. You'll suffocate long before it has arrived.

As for nested tables : they might already exist, if anyone has them Oracle will, I don't really know, but I'm not really confident that this will really make life easier on the UI side of things.

Oh, and BTW : SQL is nowhere near "theoretically pure".

Erwin Smout
It helps to read. I never said that it's theoretically pure. It also helps to follow the posting guidelines.You do make a point though, if you could have a key-value-pair type inside a single field with multiple indices, that could be a better abstraction than multiple strings in one field. Either way, you'd still be dealing with the fact that you need a UI to create/delete/update each of those pairs; you couldn't just use a simple text box.But you do also say you don't think it'll make UI easier, despite disagreeing that it's also a UI problem... so I don't know what you're trying to say.
Rei Miyasaka