Regular expression to catch letters beyond a-z

tags:

c#
regex

views:

636

answers:

+6 Q:

Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.

The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."

Jason Cohen 2009-03-17 21:46:11

I suggested [:alpha:] in an answer I have deleted. I don't know C#, so I am probably wrong, but the regex engines I'm familiar with changes the letters it matches based on locale.

Jon Ericson 2009-03-17 21:52:32

@Jon: .net does not support [:name:] for named classes, but has alternate syntax for the same purpose.

Richard 2009-03-18 11:27:28

@Jason: You would only need to list if you definition of letter differed from Unicde's, and Character Class Subtraction was insufficuent, e.g. [\p{L}-[\p{IsBasicLatin}]] would match all non-ASCII letters.

Richard 2009-03-18 11:29:51

+3 A:

What about \p{name} ?

Matches any character in the named character class specified by {name}. Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z, IsGreek, IsBoxDrawing.

I don't know enough about unicode, but maybe your characters fit a unicode class?

Ray 2009-03-17 21:47:14

No, these are "all Unicode letters" which doesn't take locale into account, which he specifically asked for.

Jason Cohen 2009-03-17 21:49:19

yeah, I know. So I deleted and then changed answer.

Ray 2009-03-17 21:50:20

A timing thing I guess, my original answer did deserve a downvote.

Ray 2009-03-17 21:54:08

+2 A:

See character categories selection with \p and \w unicode semantics.

MarkusQ 2009-03-17 21:50:01

+12 A:

You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.

My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.

Richard Szalay 2009-03-17 21:51:17

For those who are not so familar wit regex (like me), the actual correct code is: \p{Ll}

Run CMD 2010-02-11 15:29:22

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

This is not, in general, possible.

After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).

Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.

If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.

Richard 2009-03-18 11:38:02

ansaurus

tags:

views:

answers:

Regular expression to catch letters beyond a-z

related questions