views:

673

answers:

6

I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language.

My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types.

However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is:

けす I need to take that verb and convert it to the te-form けして. I would prefer to do this in javascript as it will help down the road to do more manipulation, but if I have to will just do DB calls and hold everything in a DB.

My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other languages, too. I am hoping to get more into doing language learning apps, but am lost when it comes to this.

Any advice would be great.
Thanks.

A: 

If I recall correctly (and I slacked off a lot the year I took Japanese so I could be wrong), the replacements you want to do are determined by the last symbol or two in the word. Taking your first example, any verb ending in 'す' will always have 'して' when conjugated this way. Similarly for む -> んで. Could you maybe establish a mapping of last character(s) -> conjugated form. You might have to account for exceptions, such as anything which conjugates to xxって.

As for portability between languages, you'll have to implement the logic differently based on how they work. This solution would be fairly straightforward to implement for Spanish as well, since the conjugations depends on if the verb ends in -ar, -er, or -ir (with some verbs requiring exceptions in your logic). Unfortunately, that's the limit of my multi-lingual skills, so I don't know how well it would do beyond those two.

Jimmy
Actually I have thought about doing the mapping and can see the benefit of it, but also see the benefit of the more on they fly transformation. I have been unsure of what approach and even how to deal with Japanese all together as I code. The big thing is later on when I get to short forms and tai forms is where I see the on they fly helping out.
percent20
+1  A: 

your question is totally unclear to me.

however, i had some experience working with japanese language, so i'll give my 2 Cents.

since japanese texts do not feature word separation (e.g. space character), the most important tool we had to acquire is a dictionary-based word recognizer.

once you got the text split, it's easier to manipulate it with "normal" tools.

there were only 2 tools which did the above, and as a by-product they also worked as a tagger (i.e. noun, verb, etc.).

edit: always use unicode when working w languagers.

Berry Tsakala
Sorry, My question is kind of two things in one. I was nervous to start 2 different topics so I combined a "What are some tips to work with Japanese language" and "How can I accomplish xyz".Are there any more tips you can offer with your experience anything would be great. I had not thought about sperating out words, hadn't gotten that far. Mostly am after how to manipulate individual words.However, any tips on programming with the japanese langauge is helpful and appreciated. To be honest I was trying to avoid mapping files an unicode, but looks like need to use either or both.
percent20
+16  A: 

Hi, I live in Japan and my job involves building and maintaining several Japanese/English bilingual websites which focus on natural language processing.

I can recommend a couple of practices,

*stick to unicode and utf-8 everywhere.

*stay away from the native japanese encodings: euc-jp, shiftjis, iso-2022-jp, but be aware that you'll probably encounter them at some point if you continue.

*get familiar with a segmenter for doing complicated stuff like POS analysis, word segmentation, etc. the standard tools used by most people who do NLP (natural language processing) work on Japanese are, in order of popularity/power:

MeCab: http://mecab.sourceforge.net/ MeCab is awesome, it allows you to take text like,

「日本語は、とても難しいです。」

and get all sorts of great info back

kettle:~$ echo 日本語は、難しいです | mecab 
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
は   助詞,係助詞,*,*,*,*,は,ハ,ワ
、   記号,読点,*,*,*,*,、,、,、
難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ
です  助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

which is basically a detailed run-down of the parts-of-speech, readings, pronunciations, etc. It will also do you the favor of analyzing verb tenses,

kettle:~$ echo メキシコ料理が食べたい | mecab 
メキシコ    名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ
料理  名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリ
が   助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ  動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい  助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS

however, the documentation is all in Japanese, and it's a bit complicated to setup, and figure out how to format the output the way you want it. There are packages available for ubuntu/debian, and bindings in a bunch of languages including perl, python, ruby... apt-repos for ubuntu:

deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all

packages to install: $ apt-get install mecab-ipadic-utf8 mecab python-mecab

should do the trick I think.

The other alternatives to mecab are, ChaSen, which was written years ago by the author of MeCab (who incidentally works at google now), and Kakasi, which is much less powerful.

I would definitely try to avoid rolling your own conjugation routines. the problem with this is just that it will requere tons and tons of work, which others have already done, and covering all the edge cases with rules is, at the end of the day, impossible.
MeCab is statistically driven, and trained on loads of data. it employs a sophisticated machine learning techniguqe called conditional random fields (CRFs) and the results are really quite good.

Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or whatever feel free to ask about that as well. kanji can be quite intimidating at the beginning.

Cheers

edited to add some more info.

blackkettle
I wish I could mark this as an answer too. :( Thanks for the great information. I was only going to do my own conjugation routines as a programming exercise and to better learn the core around japanese langauge. If i get further into Japanese I will definitely take a look at a segmenter.Thanks.
percent20
Awesome post, thanks for this. MeCab rocks
Wahnfrieden
Stumbled on MeCab while playing around with C#. Just wanted to add that it's awesome. There's also a MeCab webservice @ http://mimitako.net/api/mecapi.cgi . Oh and "unofficial" C# bindings @ http://en.sourceforge.jp/projects/mecabdotnet/ . Cheers!
Maiku Mori
+1  A: 

My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other langauges too.

What you want to do is pretty basic string manipution - apart from the missing word separators, as Barry notes, though that's not a technical problem.

Basically, for a modern Unicode-aware programming language (which JavaScript has been since version 1.3, I believe) there is no real difference between a Japanese kana or kanji, and a latin letter - they're all just characters. And a string is just, well, a string of characters.

Where it gets difficult is when you have to convert between strings and bytes, because then you need to pay attention to what encoding you are using. Unfortunately, many programmers, especially native English speakers tend to gloss over this problem because ASCII is the de facto standard encoding for latin letters and other encodings usually try to be compatible. If latin letters are all you need, then you can get along being blissfully ignorant about character encodings, believe that bytes and characters are basically the same thing - and write programs that mutilate anything that's not ASCII.

So the "secret" of Unicode-aware programming is this: learn to recognize when and where strings/characters are converted to and from bytes, and make sure that in all those places the correct encoding is used, i.e. the same that will be used for the reverse conversion and one that can encode all the character's you're using. UTF-8 is slowly becoming the de-facto standard and should normally be used wherever you have a choice.

Typical examples (non-exhaustive):

  • When writing source code with non-ASCII string literals (configure encoding in the editor/IDE)
  • When compiling or interpreting such source code (compiler/interpreter needs to know the encoding)
  • When reading/writing strings to a file (encoding must be specified somewhere in the API, or in the file's metadata)
  • When writing strings to a database (encoding must be specified in the configuration of the DB or the table)
  • When delivering HTML pages via a webserver (encoding must be specified in the HTML headers or the pages' meta header; forms can be even more tricky)
Michael Borgwardt
Actually after reading this and talking to a friend I tried to do basic string manipulation again based on the "everything is a string" and it worked. I have no idea what I was doing that killed the first attempt at it, but I am glad it was that easy and feel dumb for it not working the first time.Thanks for the response.
percent20
+2  A: 

What you need to do is to look at the rules of grammar. Have an array of rules for each conjugation. Let's take 〜て form for example. Psudocode :

def te_form(verb)
  switch verb.substr(-1, 1) == "る" then return # verb minus ru plus te
  case "る" #return (verb - る) + て
  case "す" #return (verb - す)+して

etc. Basically, break it down into Type I, II and III verbs.

A: 

Since most verbs in Japanese follow one of a small set of predictable patterns, the easiest and most extensible way to generate all the forms of a given verb is to have the verb know what conjugation it should follow, then write functions to generate each form depending on the conjugation.

Pseudocode:

generateDictionaryForm(verb)
  case Ru-Verb: verb.stem + る
  case Su-Verb: verb.stem + す
  case Ku-Verb: verb.stem + く
  ...etc.

generatePoliteForm(verb)
  case Ru-Verb: verb.stem + ります
  case Su-Verb: verb.stem + します
  case Ku-Verb: verb.stem + きます
  ...etc.

Irregular verbs would of course be special-cased.

Some variant of this would work for any other fairly regular language (i.e. not English).

Amanda S