views:

221

answers:

6

I want to split a sentence into a list of words.

For English and European languages this is easy, just use split()

>>> "This is a sentence.".split()
['This', 'is', 'a', 'sentence.']

But I also need to deal with sentences in languages such as Chinese that don't use whitespace as word separator.

>>> u"这是一个句子".split()
[u'\u8fd9\u662f\u4e00\u4e2a\u53e5\u5b50']

Obviously that doesn't work.

How do I split such a sentence into a list of words?

UPDATE:

So far the answers seem to suggest that this requires natural language processing techniques and that the word boundaries in Chinese are ambiguous. I'm not sure I understand why. The word boundaries in Chinese seem very definite to me. Each Chinese word/character has a corresponding unicode and is displayed on screen as an separate word/character.

So where does the ambiguity come from. As you can see in my Python console output Python has no problem telling that my example sentence is made up of 5 characters:

这 - u8fd9
是 - u662f
一 - u4e00
个 - u4e2a
句 - u53e5
子 - u5b50

So obviously Python has no problem telling the word/character boundaries. I just need those words/characters in a list.

+6  A: 

You can do this but not with standard library functions. And regular expressions won't help you either.

The task you are describing is part of the field called Natural Language Processing (NLP). There has been quite a lot of work done already on splitting Chinese words at word boundaries. I'd suggest that you use one of these existing solutions rather than trying to roll your own.

Where does the ambiguity come from?

What you have listed there is Chinese characters. These are roughly analagous to letters, or actually syllables, in English. There is no ambiguity about where the character boundaries are - this is very well defined. But you asked not for character boundaries but for word boundaries. Chinese words can consist of more than one character.

If all you want is to find the characters then this is very simple and does not require an NLP library. Simply decode the message into a unicode string (if it is not already done) then convert the unicode string to a list using a call to the builtin function list. This will give you a list of the characters in the string. For your specific example:

>>> list(u"这是一个句子")
Mark Byers
Indeed. I might add http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html
Jim Brissom
The vast majority of Chinese characters have an individual meaning by themselves. In the example: "这是一个句子" each of these characters mean something - "这=this," "是=is", "一=one," "个=a", "句=sentence." The tricky part is some are compounded characters that mean one "thing" (eg: "句子" = sentence). Also sometimes compounded characters have a meaning completely different from the individual characters.
NullUserException
So each of these unicode characters is a 字 for the Chinese, which is not the same thing as a "word" (词), but also not equivalent to a Western letter, or syllable.
NullUserException
+2  A: 

It's partially possible with Japanese, where you usually have different character classes at the beginning and end of the word, but there are whole scientific papers on the subject for Chinese. I have a regular expression for splitting words in Japanese if you are interested: http://hg.hatta-wiki.org/hatta-dev/file/cd21122e2c63/hatta/search.py#l19

Radomir Dopieralski
+2  A: 

Languages like Chinese have a very fluid definition of a word. E.g. One meaning of ma is "horse". One meaning of shang is "above" or "on top of". A compound is "mashang" which means literally "on horseback" but is used figuratively to mean "immediately". You need a very good dictionary with compounds in it and looking up the dictionary needs a longest-match approach. Compounding is rife in German (famous example is something like "Danube steam navigation company director's wife" being expressed as one word), Turkic languages, Finnish, and Magyar -- these languages have very long words many of which won't be found in a dictionary and need breaking down to understand them.

Your problem is one of linguistics, nothing to do with Python.

John Machin
I guess we just have different usage of the term "word" when applying it to Chinese. To me "ma shang" ("马上") are simply 2 words/characters. 马 is the 1st, 上 is the second.
Continuation
To me, a word is a string of one or more characters that has a particular meaning. Note that some characters have no meaning of their own, and make sense only in conjunction with another character. A list of characters is not much use.
John Machin
+1  A: 

Try this: http://code.google.com/p/pymmseg-cpp/

tszming
A: 

Ok I figured it out.

What I need can be accomplished by simply using list():

>>> list(u"这是一个句子")
[u'\u8fd9', u'\u662f', u'\u4e00', u'\u4e2a', u'\u53e5', u'\u5b50']

Thanks for all your inputs.

Continuation
-1 What you think that you need is not very useful. It's like trying to extract the meaning of "breakfast" from the 2 separate concepts "break" and "fast".
John Machin
i'd say leave to the OP what he thinks is useful. and breakfast is, indeed, to 'break' the 'fast', to eat again after having stopped doing so for a while. not useful?
flow
@flow: "breakfast" *was* (not "is") to "break" the "fast" ... the connection is as tenuous as "nine dragon" -> "Kowloon"
John Machin
kowloon 九龍 quite literally means nine dragons, at least to the gazillions of people who do not know better. as wikipedia quite succinctly states, "相傳九龍地名由來最常見的解釋是九龍北部的八條山脈亦即為龍脈,加上皇帝自己,便是九條龍脈;另一說法是九只是代表多的意思". as for breakfast, http://www.thefreedictionary.com/breakfast offers: "Middle English brekfast : breken, to break; see break + faste, a fast (from Old Norse fasta, to fast; see past- in Indo-European roots)." and, well, 'was', 'is', 'will', whatever.
flow
+2  A: 

just a word of caution: using list( '...' ) (in Py3; that's u'...' for Py2) will not, in the general sense, give you the characters of a unicode string; rather, it will most likely result in a series of 16bit codepoints. this is true for all 'narrow' CPython builds, which accounts for the vast majority of python installations today.

when unicode was first proposed in the 1990s, it was suggested that 16 bits would be more than enough to cover all the needs of a universal text encoding, as it enabled a move from 128 codepoints (7 bits) and 256 codepoints (8 bits) to a whopping 65'536 codepoints. it soon became apparent, however, that that had been wishful thinking; today, around 100'000 codepoints are defined in unicode version 5.2, and thousands more are pending for inclusion. in order for that to become possible, unicode had to move from 16 to (conceptually) 32 bits (although it doesn't make full use of the 32bit address space).

in order to maintain compatibility with software built on the assumption that unicode was still 16 bits, so-called surrogate pairs were devised, where two 16 bit codepoints from specifically designated blocks are used to express codepoints beyond 65'536, that is, beyond what unicode calls the 'basic multilingual plane', or BMP, and which are jokingly referred to as the 'astral' planes of that encoding, for their relative elusiveness and constant headache they offer to people working in the field of text processing and encoding.

now while narrow CPython deals with surrogate pairs quite transparently in some cases, it will still fail to do the right thing in other cases, string splitting being one of those more troublesome cases. in a narrow python build, list( 'abc大def' ) (or list( 'abc\u5927\U00027C3Cdef' ) when written with escapes) will result in ['a', 'b', 'c', '大', '\ud85f', '\udc3c', 'd', 'e', 'f'], with '\ud85f', '\udc3c' being a surrogate pair. incidentally, '\ud85f\udc3c' is what the JSON standard expects you to write in order to represent U-27C3C. either of these codepoints is useless on its own; a well-formed unicode string can only ever have pairs of surrogates.

so what you want to split a string into characters is really:

from re import compile as _Re

_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split

def split_unicode_chrs( text ):
  return [ chr for chr in _unicode_chr_splitter( text ) if chr ]

which correctly returns ['a', 'b', 'c', '大', '', 'd', 'e', 'f'] (note: you can probably rewrite the regular expression so that filtering out empty strings becomes unnecessary).

if all you want to do is splitting a text into chinese characters, you'd be pretty much done at this point. not sure what the OP's concept of a 'word' is, but to me, 这是一个句子 may be equally split into 这 | 是 | 一 | 个 | 句子 as well as 这是 | 一个 | 句子, depending on your point of view. however, anything that goes beyond the concept of (possibly composed) characters and character classes (symbols vs whitespace vs letters and such) goes well beyond what is built into unicode and python; you'll need some natural language processing to do that. let me remark that while your example 'yes the United Nations can!'.split() does successfully demonstrate that the split method does something useful to a lot of data, it does not parse the english text into words correctly: it fails to recognize United Nations as one word, while it falsely assumes can! is a word, which it is clearly not. this method gives both false positives and false negatives. depending on your data and what you intend to accomplish, this may or may not be what you want.

flow