views:

45

answers:

3

I'm trying to analyze some UTF-8 encoded documents in a way that recognizes different language characters. For my approach to work I need to ignore non-language characters, such as control characters, mathematical symbols etc. Just trying to dissect the basic Latin section of the UTF standard has resulted in multiple regions, with characters like the division symbol being right in the middle of a range of valid Latin characters.

Is there a list somewhere that identifies these regions? Or better yet, a Regex that defines the regions or something in C# that can identify the different characters?

+5  A: 

Look at the Unicode character categories. You can match these in C# regular expressions with the character class syntax \p{catname}. So to match a lower-case letter, you would use \p{Ll}. You can combine these. [\p{Ll}\p{Lu}] matches characters in either the Ll or Lu class.

Matthew Flaschen
+1  A: 

You can use the \p{XXX} to match unicode category. For example, \p{Cc} matches all control characters.

I guess you can use \w to match all letters in (L*). It is equal to [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}] in unicode mode.

See http://www.fileformat.info/info/unicode/category/index.htm a list of category.

J-16 SDiZ
+1  A: 

You might be interested in universal alpha as defined by what's legal in a C identifier.

BCS