tags:

views:

53

answers:

3
[a-zA-Z_:]([a-zA-Z0-9_:.])*

Would this do?

+5  A: 

Do you mean XML element names? If so, no, that's too exclusive, there are lots of valid characters that that doesn't cover. More in the spec here and here:

NameStartChar    ::=    ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
                        [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
                        [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
                        [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] |
                        [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] 

NameChar         ::=    NameStartChar | "-" | "." | [0-9] | #xB7 |
                        [#x0300-#x036F] | [#x203F-#x2040] 

Name             ::=    NameStartChar (NameChar)* 
T.J. Crowder
@Johannes: Good edit, thanks. And here *I'm* the one usually complaining people aren't quoting enough.
T.J. Crowder
For namespace aware xml parsers the definition is slightly changed to only allow at most one ':', and not at the beginning. See QName on http://www.w3.org/TR/REC-xml-names/#ns-qualnames for details.
Jörn Horstmann
And if you think this is complex you should see what it was like *before* they (controversially) backported the XML 1.1 name character model to XML 1.0 Fifth Edition!
bobince
A: 

The page for XML schemas over at regular-expressions.info gives a good regex for matching XML names:

The regular expression \i\c* matches an XML name like xml:schema. In other regular expression flavors, you'd have to spell this out as [_:A-Za-z][-._:A-Za-z0-9]*. The latter regex also works with XML's regular expression flavor. It just takes more time to type in.

(The page gives a full explanation of how they work.)

Noldorin
*"...you'd have to spell this out as `[_:A-Za-z][-._:A-Za-z0-9]*`..."* Surely regardless of regex flavor, that misses out quite a number of the valid characters?
T.J. Crowder
Down vote, why? This is pretty practical and simple implementation, I see nothing wrong with it in the vast majority of potential cases.
Noldorin
@Noldorin: You don't think a French XML document will have an element called `frère` ("brother")? A German one an element called `gespräch` ("discussion")? A Swedish one with an element called `frågan` ("question")?
T.J. Crowder
I wasn't aware regex differentiated between characters with and without accents. Hasty assumption perhaps... downvote, nah.
Noldorin
@Noldorin: Regexes are precise, for a reason. The answer was misleading when you posted it, it was pointed out to you, you did nothing to fix it. It promotes a mistaken and damaging misconception in direct response to a question about that misconception. A long time after it was pointed out to you, it was downvoted. It has nothing to do with "accented" letters. (You've at a stroke denied the Greeks use of XML almost entirely, for example.)
T.J. Crowder
A: 

EDIT:

.NET also has the method XmlConvert.VerifyName(string).

From Wikipedia:

Unicode characters in the following code point ranges are valid in XML 1.0 documents:

  • U+0009
  • U+000A
  • U+000D
  • U+0020–U+D7FF
  • U+E000–U+FFFD
  • U+10000–U+10FFFF

Unicode characters in the following code point ranges are always valid in XML 1.1 documents:

  • U+0001–U+0008
  • U+000B–U+000C
  • U+000E–U+001F
  • U+007F–U+0084
  • U+0086–U+009F

The preceding code points are contained in the following code point ranges which are only valid in certain contexts in XML 1.1 documents:

  • U+0001–U+D7FF
  • U+E000–U+FFFD
  • U+10000–U+10FFFF
JohnB
That simplest case it getting you not very far; I'd consider it harmful, in fact.
Joey
@Johannes: ok I'll agree to that, but the purpose of that simple example was more in response to his initial RegEx expression (I thought it would be helpful to him), before you did such a great job of editing it :)
JohnB