




I'm currently trying to write a RegexReplace to ensure an input can be used as a valid XML-tag, meaning: no spaces, no special characters, only lowercase, etc...

Is there a common approach to that or do I have to do it all from scratch?


string Invalid = "asd(%4 asKUd n!%mn &§a_As1";  // Invalid as a tag

string Valid = FormatToSafeXmlTag(Invalid);  // How to write this function?

// Valid = "asd4_askud_nmna_as1"
+2  A: 
  • Only lowercase: ^[a-z]+$
  • First char lowercase, (optional) remaining chars lowecase/numbers ^[a-z][a-z0-9]*$
  • Only uppercase: ^[A-Z]+$
  • First char alphabet, (optional) remaining chars alphanumeric ^[a-zA-Z][a-zA-Z0-9]*$

EDIT: To trim off everything but lowercase characters in javascript:

str = str.replace(/[^a-z]/g, "");

The catch is when users enters nothing but unacceptable characters - you will end up trying to create an xml tag with an empty string. I'd rather ask user to try again - how hard it can be to enter a lowercase string?

CAUTION: Another edge case is when user enters xml or any case-insensitive variants thereof (thanks to @Tim's answer). If you are on javascript, you cannot use the solution suggested by Tim, as it uses lookbehind, a feature unsupported by the javascript regex.

JavaScript code:

str = str.replace(/\s/g, "_"); //replaces spaces

str = str.replace(/[^a-zA-Z0-9_\-]/g, "");//trim symbols

var reg = new RegExp(/^xml/i); 

if(str.length == 0 || reg.test(str)) //is it empty or "xml" or "XmL" or ..
    alert("invalid regex");
Are single character tags OK? (Your second and last examples will only match tags with multiple characters)
Correct, no doubts - but I don't want to create error messages, so if any invalid characters are found, they'd just be stripped out.
@Dexter you spotted it while I was editing it.
@ApoY2k What if user enters only invalid characters?
Well, okay in that case you could throw an exception. But that's the only case.
see my edit - assuming you are on javascript.
+6  A: 

According to the XML specification, an element's name is formed in the following way:

Name   ::=  NameStartChar (NameChar)*


NameStartChar  ::=  ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] 
  | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] 
  | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] 
  | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar       ::=  NameStartChar | "-" | "." | [0-9] | #xB7 
  | [#x0300-#x036F] | [#x203F-#x2040]

Which is trivial to convert to a regular expression.

If you're looking to remove any character outside of this definition, simply invert the characters the expression is looking for.

+1  A: 

XML tags (I assume you're asking about tag names) have to follow these rules:

  • start with letter, dot, colon or underscore
  • only contain letters, numbers, dot, underscore or colon (for namespaces)
  • must not start with xml

Therefore, a regex for valid tag names could be:


depending on your regex flavor (e.g., .NET includes Unicode letters in \w, as is legal for a tag name). You could also use


if \w doesn't contain Unicode letters.

But of course you can use more restrictive rules, and possibly not all XML parsers can handle full Unicode tag names. So in the end,


might be your best bet...

Tim Pietzcker
where do you get that it must not start with "xml"? I don't think that is in the spec. See Welbog's answer.
I read that here: http://de.selfhtml.org/xml/dtd/bearbeitungsregeln.htm#namen (sorry, it's in German) where it says that `xml` is reserved for later extensions of the standard.
Tim Pietzcker
AFAIK, `xml`-tags are reserved for, well.. xml-specific tags^^
There is of course the "xml" tag at the beginning of xml docs, but the spec does not place a restriction on those tags that begin with "xml". Again see the grammar that welbog posted.