views:

65

answers:

2

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean).

Given such a string, I want to separate the English/French/etc part into words using whitespace as separator, and to separate the Chinese/Japanese/Korean part into individual characters.

And I want to put of all those separated components into a list.

Some examples would probably make this clear:

Case 1: English-only string. This case is easy:

>>> "I love Python".split()
['I', 'love', 'Python']

Case 2: Chinese-only string:

>>> list(u"我爱蟒蛇")
[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

In this case I can turn the string into a list of Chinese characters. But within the list I'm getting unicode representations:

[u'\u6211', u'\u7231', u'\u87d2', u'\u86c7']

How do I get it to display the actual characters instead of the unicode? Something like:

['我', '爱', '蟒', '蛇']

??

Case 3: A mix of English & Chinese:

I want to turn an input string such as

"我爱Python"

and turns it into a list like this:

['我', '爱', 'Python']

Is it possible to do something like that?

+2  A: 

Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)

A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.

I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.

Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.

# -*- coding: utf-8 -*-
import itertools, unicodedata

def group_words(s):
    # This is a closure for key(), encapsulated in an array to work around
    # 2.x's lack of the nonlocal keyword.
    sequence = [0x10000000]

    def key(part):
        val = ord(part)
        if part.isspace():
            return 0

        # This is incorrect, but serves this example; finding a more
        # accurate categorization of characters is up to the user.
        asian = unicodedata.category(part) == "Lo"
        if asian:
            # Never group asian characters, by returning a unique value for each one.
            sequence[0] += 1
            return sequence[0]

        return 2

    result = []
    for key, group in itertools.groupby(s, key):
        # Discard groups of whitespace.
        if key == 0:
            continue

        str = "".join(group)
        result.append(str)

    return result

if __name__ == "__main__":
    print group_words(u"Testing English text")
    print group_words(u"我爱蟒蛇")
    print group_words(u"Testing English text我爱蟒蛇")
Glenn Maynard
+1  A: 

I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)

# -*- coding: utf-8 -*-
import re
def group_words(s):
    regex = []

    # Match a whole word:
    regex += [ur'\w+']

    # Match a single CJK character:
    regex += [ur'[\u4e00-\ufaff]']

    # Match one of anything else, except for spaces:
    regex += [ur'[^\s]']

    regex = "|".join(regex)
    r = re.compile(regex)

    return r.findall(s)

if __name__ == "__main__":
    print group_words(u"Testing English text")
    print group_words(u"我爱蟒蛇")
    print group_words(u"Testing English text我爱蟒蛇")

In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.

Glenn Maynard
@Glenn Maynard. Thank you very much. This is exactly what I need. Could you give me pointers on where to look up the unicode "range" for various languages?
Continuation
Not really. Characters don't group nicely by language; you can probably pick out the major ranges straightforwardly enough.
Glenn Maynard
-1 @Glenn Maynard: In the "C" locale, this fails on non-ASCII non-CJK alphabetics e.g. as found in French [OP requirement], German, Russian -- `u"München"` -> `[u'M', u'\xfc', u'nchen']`. Unfortunately this can be fixed by using the `re.UNICODE` flag but that makes `\w` match most CJK chars (category `Lo`).
John Machin
@John Machin: I explicitly said that defining the exact character groupings is up to the user, as it's beyond the scope of this answer, which simply shows the method. In the future, please read answers before downvoting them.
Glenn Maynard
@Continuation: it's rangeS (plural) ... for example, for Japanese you need CJK range(s) as per Glenn's answer *PLUS* Hiragana and Katakana (U+3040 to U+30FF. The discussion on each of the "blocks" in the Unicode standard and the associated data file (`http://www.unicode.org/Public/UNIDATA/Blocks.txt`) may help. BTW, do you regard Chinese Traditional and Chinese Simplified as different "languages"?
John Machin
@Glenn Maynard: I did read your answer. Your simplistic re-based character-grouping doesn't really hack it.
John Machin
@John Machin: The fact that it may not be able to cover more complex cases was the caveat stated at the very top of the answer. Downvoting an answer due to limitations which are *stated explicitly in the answer* is senseless.
Glenn Maynard
@Continuation: Don't forget the half-width katakana range, up around U+FF66. Note that it doesn't really make sense to split Japanese words per-character, eg. 欲しい to three separate characters, but handling that in general is a much harder problem.
Glenn Maynard
@Glenn Maynard: That was waffle, not a caveat. "You may well not need ..." -- he said he needs French etc; your answer doesn't do it. FAIL.
John Machin
@John Machin: Sure it won't handle French ... if you selectively ignore part of the answer and don't actually flesh out the character groupings. Read as selectively as you like, if it makes you feel better.
Glenn Maynard
@John Machin: Thanks for pointing out the limitations of this solution. Can you suggest a solution that would cover French, German, etc?
Continuation
@John Machin: regarding your questions about Chinese Traditional and Chinese Simplified, I consider them the same language. Would this re-based solution work on both Chinese traditional and simplified?
Continuation
This solution does cover French, German, etc. That's explained in the answer and again above; John is just trying too hard to justify his downvote, and causing confusion as a result. Again, you need to fill in the groupings, eg. replace `\w+` with `[a-zA-Z\u00C0-\u02AF...]+`, and so on.
Glenn Maynard
@Glenn - Yeah I'm definitely confused. Can you elaborate on what do you mean by "filling in the groupings"? Since you said that characters don't group nicely by language, how do I find out what groupings to replace \w+ with? When you said "[a-zA-Z\u00C0-\u02AF...]+, and so on", how do I find out what is "so on"? Really appreciate your help.
Continuation
@Glenn: `\u00C0-\u02AF`? `\u00D7` and `\u00F7` (MULTIPLICATION SIGN, DIVISION SIGN) are letters?
John Machin
You need to look at the characters and decide whether you want them grouped as part of "words", treated individually (as in CJK characters), or ignored (spaces, maybe punctuation). Maybe you can generate regex classes with a script and Unicode data, or if you only need a few Western languages, just go over the ranges for those languages. Characters don't map uniquely to languages, but you can see which characters are used *by* a language by looking at the older charmaps, like ISO-8859-1. Sorry, but I can't do this for you.
Glenn Maynard