tags:

views:

56

answers:

3

I want to detect words in text, i.e. I need to know which characters in a given text are letters, that is they can be part of a (spoken) word and which are, on the other hand, punctuation and such.

For example, in the above sentence, "I", "want" and "i" and "e" are words in this regard, while spaces, "." and comma are not.

The difficulty in this is that I want to be able to read any kind of script that's based on Unicode. E.g., the german word "schön" is one word. But what about greek, arabic or japanese?

So, what I need is a table or list specifying all ranges of characters that can form words. Optionally, I also like to know which chars are digits that can form numbers (assuming other scripts have similar numbering schemes as the arabic numbers do).

I need this for Mac OS X, Windows and Linux. I'll write a C app, so it needs to be either a OS library or a complete code/data solution that I could translate into C.

I know that Mac OS (Cocoa) offers functions for this purpose, but I am not sure if there are similar solutions for Win and Linux (gtk based, probably?).

Alternatively, I could write my own code if I had the complete tables.

I have found the unicode charts (http://unicode.org/charts/index.html#scripts) but that's not coming in one convenient form I could use in programming.

So, can someone tell me if there are functions for Windows and Linux for this purpose, or where I can find a complete table/list of word characters in unicode?

+1  A: 

If you are familiar with Python at all, the Natural Language Toolkit provides chunkers/ lexical tools that will do this across languages. I'd pretend to be smart here and tell you more, but everything I know is out of this book, which I highly recommend. I realize you could code up a technical solution with a regex that would get you 80% of the way to where you want to be, but why reinvent the wheel?

Tom
I don't think I can expect a standard regex to recognize greek chars as letters via something like "\w". So I'd have to feed it all the possible letter codes one by one. But first I'd have to have this list.
Thomas Tempelmann
There are regex engines (including, optionally, Python's) that implement the Unicode character database for `\w` et al. Some also have the richer `\p{...}` character class selector.
bobince
Yes, see Python's doc for `re` module. It has `re.UNICODE` which, as it says, "Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database." http://docs.python.org/library/re.html
Craig McQueen
I see. The regex lib I have access to per default does not support unicode, but I could lookinto getting a better one.
Thomas Tempelmann
+3  A: 

You can try to use the Unicode character category to figure out what the word separators may be, but be aware that some languages (e.g. Japanese) do not even have word separators.

Ignacio Vazquez-Abrams
Yes, that's the kind of tables I was looking for.Now, this fileinfo size doesn't look very dependable. E.g, it doesn't state its sources (e.g. on which unicode version is it based on), or whether it's complete or not. How do I know this isn't just one bloke having collected just what he needed for his own needs and skipped all the ugly rest?
Thomas Tempelmann
You'd have to parse the raw data files (http://www.unicode.org/Public/5.2.0/ucd/) in order to get the whole thing. Additionally, some languages such as Python already have it in a convenient (for them) format (http://docs.python.org/library/unicodedata.html).
Ignacio Vazquez-Abrams
A: 

the c-runtime has

  • ispunct() is a punctuation character
  • isctrl() is a control character.
John Knoeller
Those generally deal well with the local 8-bit character set, not with Unicode text that may be in a variety of scripts and languages.
Adrian McCarthy
the MSVC versions handle unicode
John Knoeller