Hi, guys!
I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走
(with spaces it would be: 主楼 怎么 走
).
At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will:
1) try to find the first two characters of the sentence in the database (主楼
),
2) if 主楼
is actually a word and it's in the database the script will try to find first three characters (主楼怎
). 主楼怎
is not a word, so it's not in the database => my application now knows that 主楼
is a separate word.
3) try do it with the rest of characters.
I don't really like this approach, because to analyze even a small text it would query the database too many times.
Is there any other solutions to this?
Any suggestions are greatly appreciated.
Thank you!