views:

213

answers:

7

Which regular expression can I use to match (allow) any kind of letter from any language

I need to match any letter including any diacritics (e.g. á, ü, ñ, etc.) and exlude any kind of symbol (math symbols, currency signs, dingbats, box-drawing characters, etc.) and punctuation characters.

I'm using asp.net MVC 2 with .net 4. I've tried this annotaion in my view model:

[RegularExpression(@"\p{L}*", ...

and this one:

[RegularExpression(@"\p{L}\p{M}*", ...

but client side validation does not work.

UPDATE: Thank you for all your answers, your sugestions work but only for .net and the problem here is that it also uses the regex for client side validation with javascript (sorry if this was not clear enough). I had to go with:

[^0-9_\|°¬!#\$%/\()\?¡¿+{}[]:.\,;@ª^*<>=&]*

which is very ugly and does not cover all scenarios but is the closest thing to what I need.

A: 

\w - matches any alphanumeric character (including numbers)

In my tests it has matched:

  • ã
  • à
  • ç
  • 8
  • z

and hasn't matched:

  • ;
  • ,
  • \
  • :

In case you know exactly what you want to exclude (like a little list) you cand do the following:

[^;,\`.]

which matches one time any character that isnt:

  • ;
  • ,
  • \
  • `
  • .

Hope it helps!

MarceloRamires
`\w` will also match `_`
Senseful
@eagle hmm.. you're right, at least i've given an alternative. Gonna check it out though
MarceloRamires
\w - stands for Word. Not letter.
Lukas Šalkauskas
It also matches numbers which the OP does not want.
Tim Pietzcker
@Lukas: This is misleading. `\w` matches a single character, not a word. It will match letters, numbers and the underscore. Whether it matches only ASCII letters or Unicode letters varies between regex flavors - in .NET it's Unicode.
Tim Pietzcker
@Tim_Pietzcker I'm actually just learning REGEX, thank you, this was useful even for me =)
MarceloRamires
@Tim: [:word:] = \w = [A-Za-z0-9_] => Alphanumeric *characters* plus "_". (details here: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Extended_Regular_Expressions)
Lukas Šalkauskas
@Lukas, yes of course, I'm aware of this. But in your inital comment it is easy to misunderstand you when you distinguish between "word" and "letter" - I guess 90 % would think in this context that you're referring to the real-life meaning of word, i. e. a sequence of letters with semantic meaning.
Tim Pietzcker
A: 

All information you need about these kind of regex you can find here I hope this will help you not just in this particular case :)

Lukas Šalkauskas
A: 

\p{L}* should match "any kind of letter from any language". It should work, I used it in a i18n-proof uppercase/lowercase recognition regex in .NET.

Jan Willem B
Then the problem might be more specific than I thought, I'll update the question
pedro
+1  A: 

Ignore your grammar teacher and use double-negatives:

[^\W\d_]

Remember that \w matches any letter, digit, or underscore, so exclude them as above. You might read it as “not not-a-word-character, not a digit, and not an underscore” — which leaves only letters. Apply DeMorgan's theorem, and it makes more sense: “a word-character but neither a digit nor an underscore.”

Greg Bacon
+4  A: 

You can use Char.IsLetter:

Indicates whether the specified Unicode character is categorized as a Unicode letter.

With .Net 4.0:

string onlyLetters = String.Concat(str.Where(Char.IsLetter));

On 3.5 String.Concat only excepts an array, so you should also call ToArray.

Kobi
+1 Better off with Char.IsLetter than regex :)
Christian
This doesn't answer the question, not necessarely the question is to solve a problem, maybe it was made to learn REGEX, i don't know. Ok, it may be a problem, but he specifically asks how to do that with regex (through the question, a tag, and even the title), which is clearly accomplishable. +1 for solving the 'problem', -1 for not answering the question. Neutral.
MarceloRamires
That does not work "on the client side"
GvS
@Marcelo - Looking more closely on the question, you are probably right. `[` suggest this is used as an Attribute, and possibly cannot be replaced by code.
Kobi
+1  A: 

One thing to watch out for is the client-side regex. It uses javascript regex on the client side and .net regex on the server side. Javascript won't support this scenario.

Greg
A: 

Your problem is more likely to the fact that you will only have to have one alpha-char, because the regex will match anything that has at least one char.

By adding ^ as prefix and $ as postfix, the whole sentence should comply to your regex. So this prob works:

^\p{L}*$

Regexbuddy explains:

  1. ^ Assert position at beginning of the string
  2. \p{L} A character with the Unicode property 'letter' (any kind of letter from any kind of language) 2a. Between zero and unlimited times, as many as possible (greedy)
  3. $ Assert position at the end of the string
Jan Jongboom