just a word of caution: using list( '...' )
(in Py3; that's u'...'
for Py2) will not, in the general sense, give you the characters of a unicode string; rather, it will most likely result in a series of 16bit codepoints. this is true for all 'narrow' CPython builds, which accounts for the vast majority of python installations today.
when unicode was first proposed in the 1990s, it was suggested that 16 bits would be more than enough to cover all the needs of a universal text encoding, as it enabled a move from 128 codepoints (7 bits) and 256 codepoints (8 bits) to a whopping 65'536 codepoints. it soon became apparent, however, that that had been wishful thinking; today, around 100'000 codepoints are defined in unicode version 5.2, and thousands more are pending for inclusion. in order for that to become possible, unicode had to move from 16 to (conceptually) 32 bits (although it doesn't make full use of the 32bit address space).
in order to maintain compatibility with software built on the assumption that unicode was still 16 bits, so-called surrogate pairs were devised, where two 16 bit codepoints from specifically designated blocks are used to express codepoints beyond 65'536, that is, beyond what unicode calls the 'basic multilingual plane', or BMP, and which are jokingly referred to as the 'astral' planes of that encoding, for their relative elusiveness and constant headache they offer to people working in the field of text processing and encoding.
now while narrow CPython deals with surrogate pairs quite transparently in some cases, it will still fail to do the right thing in other cases, string splitting being one of those more troublesome cases. in a narrow python build, list( 'abc大def' )
(or list( 'abc\u5927\U00027C3Cdef' )
when written with escapes) will result in ['a', 'b', 'c', '大', '\ud85f', '\udc3c', 'd', 'e', 'f']
, with '\ud85f', '\udc3c'
being a surrogate pair. incidentally, '\ud85f\udc3c'
is what the JSON standard expects you to write in order to represent U-27C3C
. either of these codepoints is useless on its own; a well-formed unicode string can only ever have pairs of surrogates.
so what you want to split a string into characters is really:
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
which correctly returns ['a', 'b', 'c', '大', '', 'd', 'e', 'f']
(note: you can probably rewrite the regular expression so that filtering out empty strings becomes unnecessary).
if all you want to do is splitting a text into chinese characters, you'd be pretty much done at this point. not sure what the OP's concept of a 'word' is, but to me, 这是一个句子 may be equally split into 这 | 是 | 一 | 个 | 句子 as well as 这是 | 一个 | 句子, depending on your point of view. however, anything that goes beyond the concept of (possibly composed) characters and character classes (symbols vs whitespace vs letters and such) goes well beyond what is built into unicode and python; you'll need some natural language processing to do that. let me remark that while your example 'yes the United Nations can!'.split()
does successfully demonstrate that the split method does something useful to a lot of data, it does not parse the english text into words correctly: it fails to recognize United Nations
as one word, while it falsely assumes can!
is a word, which it is clearly not. this method gives both false positives and false negatives. depending on your data and what you intend to accomplish, this may or may not be what you want.