views:

582

answers:

9

Hi, guys!

I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼 怎么 走).

At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:

1) try to find the first two characters of the sentence in the database (主楼),

2) if 主楼 is actually a word and it's in the database the script will try to find first three characters (主楼怎). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.

3) try do it with the rest of characters.

I don't really like this approach, because to analyze even a small text it would query the database too many times.

Is there any other solutions to this?

Any suggestions are greatly appreciated.

Thank you!

A: 

Well, if you have a database with all words and there is no other way to get those word involved I think you are forced to re-query the database.

Younes
A: 

You can build very very long Regular Expression.

Edit: I meant to build it automatically with script from the DB. Not to write it by hand.

Y. Shoham
What do you think a regex would look like that filters a complete dictionary??
Younes
@Younes : very very long? ... and ugly
Bedwyr Humphreys
using a regex to solve this problem is a nonsense
Eineki
@Eineki: After compilation, most implementation of regex will build kind-of-trie-automata, as Vincent Ramdhanie suggested.
Y. Shoham
@Bedwyr: That's what i meant ;)
Younes
+2  A: 

You might want to consider using a trie data structure. You first construct the trie from the dictionary then searching for valid words will be much faster. The advantage is determining if you are at the end of a word or need to continue looking for longer words is very fast.

Vincent Ramdhanie
It also can be implemented as Regex (you know: automata, finite state machines, regular language etc.). After compilation, most implementation of regex will build kind-of trie.
Y. Shoham
@Y. Shoham: Some say Regex makes you toast in the morning.
Wim Hollebrandse
+1  A: 

You have the input text, sentence, paragraph whatever. So yes, your processing of it will need to query against your DB for each check.

With decent indexing on the word column though, you shouldn't have too many problems.

Having said that, how big is this dictionary? After all, you would only need the words, not their definitions to check whether it's a valid word. So if at all possible (depending on the size), having a huge memory map/hashtable/dictionary with just keys (the actual words) may be an option and would be quick as lightning.

At 15 million words, say average 7 characters @ 2 bytes each works out around the 200 Megabytes mark. Not too crazy.

Edit: At 'only' 1 million words, you're looking at around just over 13 Megabytes, say 15 with some overhead. That's a no-brainer I would say.

Wim Hollebrandse
At the moment there are only 300 000 words, at maximum there will be 1 million. So I guess it is an option =)
Peterim
In that case, I would *definitely* throw all the words in a dictionary object in memory.
Wim Hollebrandse
A: 

To improve the performance of this, can't you do all these checks before you insert the sentence into the database, and add spaces yourself?

Rob Fonseca-Ensor
The issue seems to be that for the checks, you'd need to query the dictionary...just so you know where to break up the sentence.
Wim Hollebrandse
Unfortunately, this should be done automatically by the application. A user inputs a sentence and gets the separate words with pinyin.
Peterim
A: 

(using ABCDE to represent Chinese characters for simplicity)

Let's say you've got the 'sentence' ABCDE input, and your dictionary contains these words that start with A: AB, ABC, AC, AE, and ABB. And presume that the word CDE exists, but DE, nor E do not.

When parsing the input sentence, going left to right, the script pulls the first character A. Instead of querying the database to see if A is a word, query the database to pull all words that start with A.

Loop through those results, grabbing the next few characters from the input string to get a proper comparison:

AB  ?= AB : True
ABC ?= ABC: True
AC  ?= AB : False
AE  ?= AB : False
ABB ?= ABC: False

At this point the program forks down the two 'true' branches it found. On the first, it presumes AB is the first word, and tries to find C-starting words. CDE is found, so that branch is possible. Down the other branch, ABC is the first word, but DE is not possible, so that branch is invalid, meaning the first must be the true interpretation.

I think this method minimized the number of calls to the database (though it might return larger sets from the database, as you're fetching sets of words all starting with the same character). If your database were indexed for this sort of searching, I think this would work better than going letter-by letter. Looking at this whole process now, and the other answers, I think this is actually a trie structure (assuming the character searched for is the root of a tree), as another poster had suggested. Well, here's an implementation of that idea!

MidnightLightning
A: 

This is a fairly standard task in computational linguistics. It goes by the name "tokenization" or "word segmentation." Try searching for "chinese word segmentation" or "chinese tokenization" and you'll find several tools that have been made to do this task, as well as papers about research systems to do it.

To do this well, you typically will need to use a statistical model built by running a machine learning system on a fairly large training corpus. Several of the systems you can find on the web come with pre-trained models.

Edward Loper
+2  A: 

Thanks to everyone for you help!

After a little research I've found some working tools (having in mind all your suggestions), that's why I'm answering my own question.

  1. A PHP class (http://www.phpclasses.org/browse/package/2431.html)

  2. A Drupal module, basically another PHP solution with 4 different segmentation algorithms (pretty easy to understand how it works) (http://drupal.org/project/csplitter)

  3. A PHP extension for Chinese word segmentation (http://code.google.com/p/phpcws/)

  4. There are some other solutions availabe if you try searching baidu.com for "中文分词"

Sincerely,

Equ

Peterim
A: 

I do realize that the chinese word segmentation problem is a very complex one, but in some cases this trivial algorithm may be sufficient: search the longest word w starting with the ith character, then start again for the i+length(w)th character.

Here's a Python implementation:

#!/usr/bin/env python
# encoding: utf-8

import re
import unicodedata
import codecs

class ChineseDict:

    def __init__(self,lines,rex):
        self.words = set(rex.match(line).group(1) for line in lines if not line.startswith("#"))
        self.maxWordLength = max(map(len,self.words))

    def segmentation(self,text):
        result = []
        previousIsSticky = False
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    break
            sticky = len(s)==1 and unicodedata.category(s)!="Lo"
            if previousIsSticky or (result and sticky):
                result[-1] += s
            else:
                result.append(s)
            previousIsSticky = sticky
            i = j
        return u" | ".join(result)

    def genWords(self,text):
        i = 0
        while i < len(text):
            for j in range(i+self.maxWordLength,i,-1):
                s = text[i:j]
                if s in self.words:
                    yield s
                    break
            i = j


if __name__=="__main__":
    cedict = ChineseDict(codecs.open("cedict_ts.u8",'r','utf-8'),re.compile(r"(?u)^.+? (.+?) .+"))
    text = u"""33. 你可以叫我夏尔
    戴高乐将军和夫人在科隆贝双教堂村过周末。星期日早晨,伊冯娜无意中走进浴室,正巧将军在洗盆浴。她感到非常意外,不禁大叫一声:“我的上帝!”
    戴高乐于是转过身,看见妻子因惊魂未定而站立在门口。他继续用香皂擦身,不紧不慢地说:“伊冯娜,你知道,如果是我们之间的隐私,你可以叫我夏尔,用不着叫我上帝……”
    """
    print cedict.segmentation(text)
    print u" | ".join(cedict.genWords(text))

The last part uses a copy of the CCEDICT dictionary to segment a (simplified) chinese text in two flavours (resp., with and without non-word characters):

33. 你 | 可以 | 叫 | 我 | 夏 | 尔
    戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末。星期日 | 早晨,伊 | 冯 | 娜 | 无意中 | 走进 | 浴室,正巧 | 将军 | 在 | 洗 | 盆浴。她 | 感到 | 非常 | 意外,不禁 | 大 | 叫 | 一声:“我的 | 上帝!”
    戴高乐 | 于是 | 转 | 过 | 身,看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口。他 | 继续 | 用 | 香皂 | 擦 | 身,不 | 紧 | 不 | 慢 | 地 | 说:“伊 | 冯 | 娜,你 | 知道,如果 | 是 | 我们 | 之间 | 的 | 隐私,你 | 可以 | 叫 | 我 | 夏 | 尔,用不着 | 叫 | 我 | 上帝……”

你 | 可以 | 叫 | 我 | 夏 | 尔 | 戴高乐 | 将军 | 和 | 夫人 | 在 | 科隆 | 贝 | 双 | 教堂 | 村 | 过 | 周末 | 星期日 | 早晨 | 伊 | 冯 | 娜 | 无意中 | 走进 | 浴室 | 正巧 | 将军 | 在 | 洗 | 盆浴 | 她 | 感到 | 非常 | 意外 | 不禁 | 大 | 叫 | 一声 | 我的 | 上帝 | 戴高乐 | 于是 | 转 | 过 | 身 | 看见 | 妻子 | 因 | 惊魂 | 未定 | 而 | 站立 | 在 | 门口 | 他 | 继续 | 用 | 香皂 | 擦 | 身 | 不 | 紧 | 不 | 慢 | 地 | 说 | 伊 | 冯 | 娜 | 你 | 知道 | 如果 | 是 | 我们 | 之间 | 的 | 隐私 | 你 | 可以 | 叫 | 我 | 夏 | 尔 | 用不着 | 叫 | 我 | 上帝 
Aristide