I want to detect words in text, i.e. I need to know which characters in a given text are letters, that is they can be part of a (spoken) word and which are, on the other hand, punctuation and such.
For example, in the above sentence, "I", "want" and "i" and "e" are words in this regard, while spaces, "." and comma are not.
The difficulty in this is that I want to be able to read any kind of script that's based on Unicode. E.g., the german word "schön" is one word. But what about greek, arabic or japanese?
So, what I need is a table or list specifying all ranges of characters that can form words. Optionally, I also like to know which chars are digits that can form numbers (assuming other scripts have similar numbering schemes as the arabic numbers do).
I need this for Mac OS X, Windows and Linux. I'll write a C app, so it needs to be either a OS library or a complete code/data solution that I could translate into C.
I know that Mac OS (Cocoa) offers functions for this purpose, but I am not sure if there are similar solutions for Win and Linux (gtk based, probably?).
Alternatively, I could write my own code if I had the complete tables.
I have found the unicode charts (http://unicode.org/charts/index.html#scripts) but that's not coming in one convenient form I could use in programming.
So, can someone tell me if there are functions for Windows and Linux for this purpose, or where I can find a complete table/list of word characters in unicode?