tags:

views:

119

answers:

3

I'm currently trying to write a RegexReplace to ensure an input can be used as a valid XML-tag, meaning: no spaces, no special characters, only lowercase, etc...

Is there a common approach to that or do I have to do it all from scratch?

Example:

string Invalid = "asd(%4 asKUd n!%mn &§a_As1";  // Invalid as a tag

string Valid = FormatToSafeXmlTag(Invalid);  // How to write this function?

// Valid = "asd4_askud_nmna_as1"
+2  A: 
  • Only lowercase: ^[a-z]+$
  • First char lowercase, (optional) remaining chars lowecase/numbers ^[a-z][a-z0-9]*$
  • Only uppercase: ^[A-Z]+$
  • First char alphabet, (optional) remaining chars alphanumeric ^[a-zA-Z][a-zA-Z0-9]*$

EDIT: To trim off everything but lowercase characters in javascript:

str = str.replace(/[^a-z]/g, "");

The catch is when users enters nothing but unacceptable characters - you will end up trying to create an xml tag with an empty string. I'd rather ask user to try again - how hard it can be to enter a lowercase string?

CAUTION: Another edge case is when user enters xml or any case-insensitive variants thereof (thanks to @Tim's answer). If you are on javascript, you cannot use the solution suggested by Tim, as it uses lookbehind, a feature unsupported by the javascript regex.

JavaScript code:

str = str.replace(/\s/g, "_"); //replaces spaces

str = str.replace(/[^a-zA-Z0-9_\-]/g, "");//trim symbols

var reg = new RegExp(/^xml/i); 

if(str.length == 0 || reg.test(str)) //is it empty or "xml" or "XmL" or ..
    alert("invalid regex");
Amarghosh
Are single character tags OK? (Your second and last examples will only match tags with multiple characters)
Dexter
Correct, no doubts - but I don't want to create error messages, so if any invalid characters are found, they'd just be stripped out.
ApoY2k
@Dexter you spotted it while I was editing it.
Amarghosh
@ApoY2k What if user enters only invalid characters?
Amarghosh
Well, okay in that case you could throw an exception. But that's the only case.
ApoY2k
see my edit - assuming you are on javascript.
Amarghosh
+6  A: 

According to the XML specification, an element's name is formed in the following way:

Name   ::=  NameStartChar (NameChar)*

Where

NameStartChar  ::=  ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] 
  | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] 
  | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] 
  | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar       ::=  NameStartChar | "-" | "." | [0-9] | #xB7 
  | [#x0300-#x036F] | [#x203F-#x2040]

Which is trivial to convert to a regular expression.

If you're looking to remove any character outside of this definition, simply invert the characters the expression is looking for.

Welbog
+1  A: 

XML tags (I assume you're asking about tag names) have to follow these rules:

  • start with letter, dot, colon or underscore
  • only contain letters, numbers, dot, underscore or colon (for namespaces)
  • must not start with xml

Therefore, a regex for valid tag names could be:

^(?!xml)[\w.:][\w\d.:]*$

depending on your regex flavor (e.g., .NET includes Unicode letters in \w, as is legal for a tag name). You could also use

^(?!xml)[p\{L}._:][\p{L}\p{N}._:]*$

if \w doesn't contain Unicode letters.

But of course you can use more restrictive rules, and possibly not all XML parsers can handle full Unicode tag names. So in the end,

^(?!xml)[A-Za-z._:][A-Za-z0-9._:]*$

might be your best bet...

Tim Pietzcker
where do you get that it must not start with "xml"? I don't think that is in the spec. See Welbog's answer.
harschware
I read that here: http://de.selfhtml.org/xml/dtd/bearbeitungsregeln.htm#namen (sorry, it's in German) where it says that `xml` is reserved for later extensions of the standard.
Tim Pietzcker
AFAIK, `xml`-tags are reserved for, well.. xml-specific tags^^
ApoY2k
There is of course the "xml" tag at the beginning of xml docs, but the spec does not place a restriction on those tags that begin with "xml". Again see the grammar that welbog posted.
harschware