ansaurus

Question

What would be a regex for valid xml names?

Answer 1

+5 A:

Do you mean XML element names? If so, no, that's too exclusive, there are lots of valid characters that that doesn't cover. More in the spec here and here:

NameStartChar    ::=    ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
                        [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
                        [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
                        [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] |
                        [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] 

NameChar         ::=    NameStartChar | "-" | "." | [0-9] | #xB7 |
                        [#x0300-#x036F] | [#x203F-#x2040] 

Name             ::=    NameStartChar (NameChar)*

T.J. Crowder 2010-07-01 13:38:21

@Johannes: Good edit, thanks. And here *I'm* the one usually complaining people aren't quoting enough.

T.J. Crowder 2010-07-01 13:41:29

For namespace aware xml parsers the definition is slightly changed to only allow at most one ':', and not at the beginning. See QName on http://www.w3.org/TR/REC-xml-names/#ns-qualnames for details.

Jörn Horstmann 2010-07-01 14:06:58

And if you think this is complex you should see what it was like *before* they (controversially) backported the XML 1.1 name character model to XML 1.0 Fifth Edition!

bobince 2010-07-01 14:19:25

Answer 2

A:

The page for XML schemas over at regular-expressions.info gives a good regex for matching XML names:

The regular expression \i\c* matches an XML name like xml:schema. In other regular expression flavors, you'd have to spell this out as [_:A-Za-z][-._:A-Za-z0-9]*. The latter regex also works with XML's regular expression flavor. It just takes more time to type in.

(The page gives a full explanation of how they work.)

Noldorin 2010-07-01 13:38:57

*"...you'd have to spell this out as `[_:A-Za-z][-._:A-Za-z0-9]*`..."* Surely regardless of regex flavor, that misses out quite a number of the valid characters?

T.J. Crowder 2010-07-01 13:40:36

Down vote, why? This is pretty practical and simple implementation, I see nothing wrong with it in the vast majority of potential cases.

Noldorin 2010-07-01 14:36:55

@Noldorin: You don't think a French XML document will have an element called `frère` ("brother")? A German one an element called `gespräch` ("discussion")? A Swedish one with an element called `frågan` ("question")?

T.J. Crowder 2010-07-01 16:14:12

I wasn't aware regex differentiated between characters with and without accents. Hasty assumption perhaps... downvote, nah.

Noldorin 2010-07-01 16:32:30

@Noldorin: Regexes are precise, for a reason. The answer was misleading when you posted it, it was pointed out to you, you did nothing to fix it. It promotes a mistaken and damaging misconception in direct response to a question about that misconception. A long time after it was pointed out to you, it was downvoted. It has nothing to do with "accented" letters. (You've at a stroke denied the Greeks use of XML almost entirely, for example.)

T.J. Crowder 2010-07-01 16:39:31

Answer 3

A:

EDIT:

.NET also has the method XmlConvert.VerifyName(string).

From Wikipedia:

Unicode characters in the following code point ranges are valid in XML 1.0 documents:

U+0009
U+000A
U+000D
U+0020–U+D7FF
U+E000–U+FFFD
U+10000–U+10FFFF

Unicode characters in the following code point ranges are always valid in XML 1.1 documents:

U+0001–U+0008
U+000B–U+000C
U+000E–U+001F
U+007F–U+0084
U+0086–U+009F

The preceding code points are contained in the following code point ranges which are only valid in certain contexts in XML 1.1 documents:

U+0001–U+D7FF
U+E000–U+FFFD
U+10000–U+10FFFF

JohnB 2010-07-01 13:52:29

That simplest case it getting you not very far; I'd consider it harmful, in fact.

Joey 2010-07-01 16:24:20

@Johannes: ok I'll agree to that, but the purpose of that simple example was more in response to his initial RegEx expression (I thought it would be helpful to him), before you did such a great job of editing it :)

JohnB 2010-07-01 21:55:31

ansaurus

tags:

views:

answers:

What would be a regex for valid xml names?

related questions