Is there a list of language only character regions for UTF-8 somewhere?

views:

answers:

+3 Q:

Is there a list of language only character regions for UTF-8 somewhere?

I'm trying to analyze some UTF-8 encoded documents in a way that recognizes different language characters. For my approach to work I need to ignore non-language characters, such as control characters, mathematical symbols etc. Just trying to dissect the basic Latin section of the UTF standard has resulted in multiple regions, with characters like the division symbol being right in the middle of a range of valid Latin characters.

Is there a list somewhere that identifies these regions? Or better yet, a Regex that defines the regions or something in C# that can identify the different characters?

+5 A:

Look at the Unicode character categories. You can match these in C# regular expressions with the character class syntax \p{catname}. So to match a lower-case letter, you would use \p{Ll}. You can combine these. [\p{Ll}\p{Lu}] matches characters in either the Ll or Lu class.

Matthew Flaschen 2010-05-17 03:21:28

+1 A:

You can use the \p{XXX} to match unicode category. For example, \p{Cc} matches all control characters.

I guess you can use \w to match all letters in (L*). It is equal to [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] in unicode mode.

See http://www.fileformat.info/info/unicode/category/index.htm a list of category.

J-16 SDiZ 2010-05-17 03:25:06

+1 A:

You might be interested in universal alpha as defined by what's legal in a C identifier.

BCS 2010-05-17 13:09:51

ansaurus

tags:

views:

answers:

Is there a list of language only character regions for UTF-8 somewhere?

related questions