tags:

views:

636

answers:

5

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

A: 

All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.

The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."

Jason Cohen
I suggested [:alpha:] in an answer I have deleted. I don't know C#, so I am probably wrong, but the regex engines I'm familiar with changes the letters it matches based on locale.
Jon Ericson
@Jon: .net does not support [:name:] for named classes, but has alternate syntax for the same purpose.
Richard
@Jason: You would only need to list if you definition of letter differed from Unicde's, and Character Class Subtraction was insufficuent, e.g. [\p{L}-[\p{IsBasicLatin}]] would match all non-ASCII letters.
Richard
+3  A: 

What about \p{name} ?

Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

I don't know enough about unicode, but maybe your characters fit a unicode class?

Ray
No, these are "all Unicode letters" which doesn't take locale into account, which he specifically asked for.
Jason Cohen
yeah, I know. So I deleted and then changed answer.
Ray
A timing thing I guess, my original answer did deserve a downvote.
Ray
+2  A: 

See character categories selection with \p and \w unicode semantics.

MarkusQ
+12  A: 

You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.

My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.

Richard Szalay
For those who are not so familar wit regex (like me), the actual correct code is: \p{Ll}
Run CMD
A: 

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

This is not, in general, possible.

After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).

Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.

If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.

Richard