tags:

views:

60

answers:

2

Simple question:
What is the pattern for the word character \w in c#, .net?

My first thought was that it matches [A-Za-z0-9_] and the documentation tells me:

Character class    Description          Pattern     Matches
\w                 Matches any          \w          "I", "D", "A", "1", "3"
                   word character.                  in "ID A1.3"

which is not very helpful.
And \w seems to match äöü, too. What else? Is there a better (exact) definition available?

+4  A: 

Basically it matches everything that can be considered the intuitive definition of letter in various scripts – plus the underscore and a few other oddballs.

You can find a complete list (at least for the BMP) with the following tiny PowerShell snippet:

0..65535 | ?{([char]$_) -match '\w'} | %{ "$_`: " + [char]$_ }
Joey
+10  A: 

From the documentation:

Word Character: \w

\w matches any word character. A word character is a member of any of the Unicode categories listed in the following table.

  • Ll (Letter, Lowercase)
  • Lu (Letter, Uppercase)
  • Lt (Letter, Titlecase)
  • Lo (Letter, Other)
  • Lm (Letter, Modifier)
  • Nd (Number, Decimal Digit)
  • Pc (Punctuation, Connector)
    • This category includes ten characters, the most commonly used of which is the LOWLINE character (_), u+005F.

If ECMAScript-compliant behavior is specified, \w is equivalent to [a-zA-Z_0-9].

See also

polygenelubricants
D'oh ... I have to learn to read the right documentation ...
tanascius