views:

365

answers:

2

I'd like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not normally have white space between words. Search is not useful when you must type the same sentence as is in the text.

I can not just put a space between every character because English must work too. I would like to solve this problem with PHP or MySQL.

Can I configure MySQL to recognize characters which should be their own indexing units? Is there a PHP module that can recognize these characters so I could just throw spaces around them for the index?

Update

A partial solution:

$string_with_spaces =
  preg_replace( "/[".json_decode('"\u4e00"')."-".json_decode('"\uface"')."]/",
  " $0 ", $string_without_spaces );

This makes a character class out of at least some of the characters I need to treat specially. I should probably mention, it is acceptable to munge the indexed text.

Does anyone know all the ranges of characters I'd need to insert spaces around?

Also, there must be a better, portable way to represent those characters in PHP? Source code in Literal Unicode is not ideal; I will not recognize all the characters; they may not render on all the machines I have to use.

+10  A: 

Word breaking for the languages mentioned require a linguistic approach, for example one that uses a dictionary along with an understanding of basic stemming rules.

I've heard of relatively successful full text search applications which simply split every single character as a separate word, in Chinese, simply applying the same "tokenization" of the search criteria supplied by the end-users. The search engine then provides a better ranking for the documents which supply the characters-words in the same order as the search criteria. I'm not sure this could be extended to Language such as Japanese, as the Hirakana and Katagana character sets make the text more akin to European languages with a short alphabet.

EDIT:
Resources
This word breaking problem, as well as related issues, is so non-trivial that whole books are written about it. See for example CJKV Information Processing (CJKV stands for Chinese, Japanese, Korean and Vietnamese; you may also use the CJK keyword, for in many texts, Vietnamese is not discussed). See also Word Breaking in Japanese is hard for a one-pager on this topic.
Understandingly, the majority of the material covering this topic is written in one of the underlying native languages, and is therefore of limited use for people without a relative fluency in these languages. For that reason, and also to help you validate the search engine once you start implementing the word breaker logic, you should seek the help of a native speaker or two.

Various ideas
Your idea of identifying characters which systematically imply a word break (say quotes, parenthesis, hyphen-like characters and such) is good, and that is probably one heuristic used by some of the professional grade word breakers. Yet, you should seek an authoritative source for such a list, rather than assembling one from scratch, based on anecdotal findings.
A related idea is to break words at Kana-to-Kanji transitions (but I'm guessing not the other way around), and possibly at Hiragana-to-Katakana or vice-versa transitions.
Unrelated to word-breaking proper, the index may [ -or may not- ;-)] benefit from the systematic conversion of every, say, hiragana character to the corresponding katakana character. Just an uneducated idea! I do not know enough about the Japanese language to know if that would help; intuitively, it would be loosely akin to the systematic conversion of accentuated letters and such to the corresponding non-accentuated letter, as practiced with several European languages.

Maybe the idea I mentioned earlier, of systematically indexing individual character (and of ranking the search results based on their proximity order-wise to the search criteria) can be slightly altered, for example by keeping consecutive kana characters together, and then some other rules... and produce a imperfect but practical enough search engine.

Do not be disappointed if this is not the case... As stated this is far from trivial, and it may save you time and money, in the long term, by taking a pause and reading a book or two. Another reason to try and learn more of the "theory" and best practices, is that at the moment you seem to be focused on word breaking but soon, the search engine may also benefit from stemming-awareness; indeed these two issues are, linguistically at least, related, and may benefit from being handled in tandem.

Good luck on this vexing but worthy endeavor.

mjv
It is totally acceptable to me to split up compound words. I just need to know when to split symbols. See my soon to be made update for a partial solution.
Joe Langeway
Forgive me. I also meant to say thank you for your time. :)
Joe Langeway
@Joe: You are welcome. I happen to have an interest in linguistics and NLP but very, very, little knowledge specific to the CJK languages. Do read my edit as I added some keywords and online references which may help your quest. Good luck :-)
mjv
+1  A: 

One year later, and you probably don't need this any more but the code on the following page might have some hints for what you want(ed) to do:

http://www.geocities.co.jp/SiliconValley-PaloAlto/7043/spamfilter/japanese-tokenizer.el.txt

If you made any progress after the above posts in your own search I am sure others would be interested to know.

(Edited to say there is a better answer here:

http://stackoverflow.com/questions/3826918/how-to-classify-japanese-characters-as-either-kanji-or-kana)

B_W
It turned out that recognizing the character range in the example in the update to my question solved the problem in all the cases that have come up so far. At least, our small number of users to whom this matters seamed satisfied.
Joe Langeway
I look forward to the day when this solution isn't adequate any more and I can solve the problem more completely and interestingly.
Joe Langeway
Thank you for your time.
Joe Langeway