views:

125

answers:

1

Well, here is a small problem.

I am to design a web application database to use what is, basically, one spoken language. However this language has two alphabets (Latin & Cyrillic) and three spoken variants that actually have only some minor differences (I think there is several hundred words that are slightly different). To make matters more difficult, database and administration interface will be used for two applications, one of which will use both scripts and two spoken variants for all articles (as well as interface) and other will use only one (Latin).

I wrote an algorithm for latin-to-cyrillic-and-back "translation", that should make an input easier to user, but now I am facing some database/application normalization issues I am in doubt about.

(1) I could treat this as a spoken language problem and create database accordingly.

| 'variant' | [id, name] #language variant
| 'article' | [id, <all variant non-dependent fields>] # such as image, relations, etc
| 'article_variant' | [id, fk_language_variant_id, fk_article_id, <all variant dependent fields>] #title, body, etc
| 'article_has_variants' | [id, fk_article_id, fk_article_variant_id]

However due to fact that most of the database entries in 'article_variant' would contain same information it seems redundant.

(2) I could do it using 'article' table and word pairs with "latin-to-cyrillic-and-back" algorithm in order to display it to user solving it on application level. Then my database would look something like this:

| 'variant' | [id, name] #language variant
| 'article' | id, <all fields>
| 'words' | id, word
| 'word_pairs' | id, fk_word_id, fk_word_id

This solution looks perfect to me, but I am worried about impact it would have on application performance due to fact (a) it has to go trough an entire article every time it is displayed on language variant different than default for each word pair and (b) after that it has to convert each letter of the alphabet (+few more times due to complex letter conversion such are nj -> њ ) to it's pair.

Any thoughts?

+1  A: 

i'd use a dynamic option with caching. Store only one variant of the article (the one is was initially created) and create other variants on the fly , like

 article = select from articles
 if article.lang != requested_lang
     text = select from cache where id=article.id and lang=requested_lang
     if !text
          text = convert_language(article.text, requested_lang)
          insert into cache(article.id, requested_lang, text
     echo text
 else
    echo article.text

(this implies that language conversions are fully automatic, i'm not sure if this is the case)

stereofrog
Yes, since it's, in essence, the same language (with variants that can be expressed as limited number of word pairs: istorija = historija / hemija = kemija), language conversions should be automatic automatic.Thank you for extending option (2) with caching. That should deal with most of the performace issues.
Krule