views:

1870

answers:

7

I'm trying to come up with a way to estimate the number of English words a translation from Japanese will turn into. Japanese has three main scripts -- Kanji, Hiragana, and Katakana -- and each has a different average character-to-word ratio (Kanji being the lowest, Katakana the highest).

Examples:

  • computer: コンピュータ (Katakana - 6 characters); 計算機 (Kanji: 3 characters)
  • whale: くじら (Hiragana -- 3 characters); 鯨 (Kanji: 1 character)

As data, I have a large glossary of Japanese words and their English translations, and a fairly large corpus of matched Japanese source documents and their English translations. I want to come up with a formula that will count numbers of Kanji, Hiragana, and Katakana characters in a source text, and estimate the number of English words this is likely to turn into.

A: 

It seems simple enough - you just need to find out the ratios.

For each script, count the number of script characters and English words in your glossary and work out the ratio.

This can be augmented with the Japanese source documents assuming you can both detect which script a Japanese word is in and what the English equivalent phrase is in the translation. Otherwise you'll have to guesstimate the ratios or ignore this as source data,

Then, as you say, count the number of words in each script of your source text, do the multiplies, and you should have a rough estimate.

paxdiablo
+1  A: 

Well, it's a little more complex than just the number of characters in a noun compared to English, for instance, Japanese also has a different grammatical structure compared to English, so certain sentences would use MORE words in Japanese, and others would use LESS words. I don't really know Japanese, so please forgive me for using Korean as an example.

In Korean, a sentence is often shorter than an English sentence, due mainly to the fact that they are cut short by using context to fill in the missing words. For instance, saying "I love you" could be as short as 사랑이 ("sarangi", simply the verb "love"), or as long as the fully qualified sentence 저는 당신이 살앙이예요 (I [topic] you [object] love [verb + polite modifier]. In a text how it is written depends on context, which is usually set by earlier sentences in the paragraph.

Anyway, having an algorithm to actually KNOW this kind of thing would be very difficult, so you're probably much better off, just using statistics. What you should do is use random samples where the known Japanese texts, and English texts have the same meaning. The larger the sample (and the more random it is) the better... though if they are truly random, it won't make much difference how many you have past a few hundred.

Now, another thing is this ratio would change completely on the type of text being translated. For instance, highly technical document is quite likely to have a much higher Japanese/English length ratio than a soppy novel.

As for simply using your dictionary of word to word translations - that probably won't work to well (and is probably wrong). The same word does not translate to the same word every time in a different language (although much more likely to happen in technical discussions). For instance, the word beautiful. There is not only more than one word I could assign it to in Korean (i.e. there is a choice), but sometimes I lose that choice, as in the sentence (that food is beautiful), where I don't mean the food looks good. I mean it tastes good, and my option of translations for that word changes. And this is a VERY common circumstance.

Another big problem is optimal translation. Something that human's are really bad at, and something that computers are much much worse at. Whenever I've proofread a document translated from another text to English, I can always see various ways to cut it much much shorter.

So although, with statistics, you would be able to work out a pretty good average ratio in length between translations, this will be far different than it would be were all translations to be optimal.

Vincent McNabb
+2  A: 

Here's what Borland (now Embarcadero) thinks about English to non-English:

Length of English string (in characters)

Expected increase
1-5      100%
6-12      80%
13-20     60%
21-30     40%
31-50     20%
over 50   10%

I think you can sort of apply this (with some modification) for Japanese to non-Japanese.

Another element you might want to consider is the tone of the language. In English, instructions are phrased as an imperative as in "Press OK." But in Japanese language, imperatives are considered rude, and you must phrase instructions in honorific (or keigo) as in "OKボタンを押してください。"

Watch out for three-letter kanji combos. Many of the big words translate into three- or four- letter kanji combo such as 国際化(internationalization: 20 chars), 高可用性(high availability: 17 chars).

eed3si9n
I'm not nitpicking, but I thought you'd like to know that 押してください is not honorific; it's just polite.
Mike Sickler
@mikemacman, I was using the term honorific broadly to include all three modes of keigo, including sonkeigo, kenjogo, and teineigo: http://ja.wikipedia.org/wiki/敬語
eed3si9n
Depends on the software; I see a lot of すること or just plain して in things like iTunes and Safari. This, of course, combined with plenty of 〜させていただきます
Don Werve
What is an approximate literal translation of "OKボタンを押してください。"? Is it obsequious, like "The fine gentleman should consider pressing the button OK", or just wordy, "For the desired result to be obtained, it is important that the button with the label containing OK could be depressed by the user"?
Ed Griebel
@Ed Griebel, the literal translation is "Please press OK button." Like I wrote in the answer, in English you are suppose to phrase the instructions in concise chop-chop manner. In Japanese culture, you have to politely ask the user. This difference in tone can affect the ratio significantly.
eed3si9n
+1  A: 

I would start with linear approximation: approx_english_words = a1*no_characters_in_script1 + a2 * no_chars_in_script2 + a3 * no_chars_in_script3, with the coefficients a1, a2, a3 fit from your data using linear least squares.

If this doesn't approximate very well, then look at the worst cases for the reasons they don't fit (specialized words, etc.).

Rafał Dowgird
I think this would only be possible to do for a specific translator since their speech/writing patterns may be more predictable.
Elijah
+1  A: 

In my experience as a translator and localization specialist, a good rule of thumb is 2 Japanese characters per English word.

Mike Sickler
+1  A: 

As an experienced translator between Japanese and English, I can say that this is extremely difficult to quantify, but typically in my experience English text translated from Japanese is nearly 200% as many characters as the source text. In Japanese there are many culturally specific phrases and nouns that can't be translated literally and need to be explained in English. When translating it is not unusual for me to take a single Japanese sentence and to make a single English paragraph out of it in order for the meaning to be communicated to the reader. Off the top of my here is an example:

「懐かしい」

This literally means nostalgic. However, in Japanese it can be used as a single phrase in an exclamation. Yet, in English in order to convey a feeling of nostalgia we require a lot more context. For instance, you may need to turn that single phrase into a sentence:

"As I walked by my old elementary school, I was flooded with memories of the past."

This is why machine translation between Japanese and English is impossible.

Elijah
+1 very true, but not impossible. You just need a big enough database and a fast enough processor to hash it all together. If humans can do it machines can too.
Hayato
A: 

My (albeit tiny) experience seems to indicate that, no matter what the language, blocks of text take the same amount of printed space to convey equivalent information. So, for a large-ish block of text, you could assign a width count to each character in English (grab this from a common font like Times New Roman), and likewise use a common Japanese font at the same point size to calculate the number of characters that would be required.

Don Werve