How can I tell which unicode characters are letters (words) versus being punctation marks?

I want to detect words in text, i.e. I need to know which characters in a given text are letters, that is they can be part of a (spoken) word and which are, on the other hand, punctuation and such.

For example, in the above sentence, "I", "want" and "i" and "e" are words in this regard, while spaces, "." and comma are not.

The difficulty in this is that I want to be able to read any kind of script that's based on Unicode. E.g., the german word "schön" is one word. But what about greek, arabic or japanese?

So, what I need is a table or list specifying all ranges of characters that can form words. Optionally, I also like to know which chars are digits that can form numbers (assuming other scripts have similar numbering schemes as the arabic numbers do).

I need this for Mac OS X, Windows and Linux. I'll write a C app, so it needs to be either a OS library or a complete code/data solution that I could translate into C.

I know that Mac OS (Cocoa) offers functions for this purpose, but I am not sure if there are similar solutions for Win and Linux (gtk based, probably?).

Alternatively, I could write my own code if I had the complete tables.

I have found the unicode charts (http://unicode.org/charts/index.html#scripts) but that's not coming in one convenient form I could use in programming.

So, can someone tell me if there are functions for Windows and Linux for this purpose, or where I can find a complete table/list of word characters in unicode?

I don't think I can expect a standard regex to recognize greek chars as letters via something like "\w". So I'd have to feed it all the possible letter codes one by one. But first I'd have to have this list.

Thomas Tempelmann 2010-02-11 23:06:42

There are regex engines (including, optionally, Python's) that implement the Unicode character database for `\w` et al. Some also have the richer `\p{...}` character class selector.

bobince 2010-02-11 23:40:31

Yes, see Python's doc for `re` module. It has `re.UNICODE` which, as it says, "Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database." http://docs.python.org/library/re.html

Craig McQueen 2010-02-12 04:15:22

I see. The regex lib I have access to per default does not support unicode, but I could lookinto getting a better one.

Thomas Tempelmann 2010-02-12 13:54:48

Yes, that's the kind of tables I was looking for.Now, this fileinfo size doesn't look very dependable. E.g, it doesn't state its sources (e.g. on which unicode version is it based on), or whether it's complete or not. How do I know this isn't just one bloke having collected just what he needed for his own needs and skipped all the ugly rest?

Thomas Tempelmann 2010-02-11 23:06:59

You'd have to parse the raw data files (http://www.unicode.org/Public/5.2.0/ucd/) in order to get the whole thing. Additionally, some languages such as Python already have it in a convenient (for them) format (http://docs.python.org/library/unicodedata.html).

Ignacio Vazquez-Abrams 2010-02-11 23:17:38

Those generally deal well with the local 8-bit character set, not with Unicode text that may be in a variety of scripts and languages.

Adrian McCarthy 2010-02-11 23:20:00

the MSVC versions handle unicode

John Knoeller 2010-02-12 00:50:37

ansaurus

tags:

views:

answers:

How can I tell which unicode characters are letters (words) versus being punctation marks?

related questions