ansaurus

Question

Regex for reformatting a string to a safe xml-tag

Answer 1

+2 A:

Only lowercase: ^[a-z]+$
First char lowercase, (optional) remaining chars lowecase/numbers ^[a-z][a-z0-9]*$
Only uppercase: ^[A-Z]+$
First char alphabet, (optional) remaining chars alphanumeric ^[a-zA-Z][a-zA-Z0-9]*$

EDIT: To trim off everything but lowercase characters in javascript:

str = str.replace(/[^a-z]/g, "");

The catch is when users enters nothing but unacceptable characters - you will end up trying to create an xml tag with an empty string. I'd rather ask user to try again - how hard it can be to enter a lowercase string?

CAUTION: Another edge case is when user enters xml or any case-insensitive variants thereof (thanks to @Tim's answer). If you are on javascript, you cannot use the solution suggested by Tim, as it uses lookbehind, a feature unsupported by the javascript regex.

JavaScript code:

str = str.replace(/\s/g, "_"); //replaces spaces

str = str.replace(/[^a-zA-Z0-9_\-]/g, "");//trim symbols

var reg = new RegExp(/^xml/i); 

if(str.length == 0 || reg.test(str)) //is it empty or "xml" or "XmL" or ..
    alert("invalid regex");

Amarghosh 2009-11-30 16:15:40

Are single character tags OK? (Your second and last examples will only match tags with multiple characters)

Dexter 2009-11-30 16:18:16

Correct, no doubts - but I don't want to create error messages, so if any invalid characters are found, they'd just be stripped out.

ApoY2k 2009-11-30 16:19:11

@Dexter you spotted it while I was editing it.

Amarghosh 2009-11-30 16:20:00

@ApoY2k What if user enters only invalid characters?

Amarghosh 2009-11-30 16:20:31

Well, okay in that case you could throw an exception. But that's the only case.

ApoY2k 2009-11-30 16:23:31

see my edit - assuming you are on javascript.

Amarghosh 2009-11-30 16:24:53

Answer 2

+6 A:

According to the XML specification, an element's name is formed in the following way:

Name   ::=  NameStartChar (NameChar)*

Where

NameStartChar  ::=  ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] 
  | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] 
  | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] 
  | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar       ::=  NameStartChar | "-" | "." | [0-9] | #xB7 
  | [#x0300-#x036F] | [#x203F-#x2040]

Which is trivial to convert to a regular expression.

If you're looking to remove any character outside of this definition, simply invert the characters the expression is looking for.

Welbog 2009-11-30 16:20:49

Answer 3

+1 A:

XML tags (I assume you're asking about tag names) have to follow these rules:

start with letter, dot, colon or underscore
only contain letters, numbers, dot, underscore or colon (for namespaces)
must not start with xml

Therefore, a regex for valid tag names could be:

^(?!xml)[\w.:][\w\d.:]*$

depending on your regex flavor (e.g., .NET includes Unicode letters in \w, as is legal for a tag name). You could also use

^(?!xml)[p\{L}._:][\p{L}\p{N}._:]*$

if \w doesn't contain Unicode letters.

But of course you can use more restrictive rules, and possibly not all XML parsers can handle full Unicode tag names. So in the end,

^(?!xml)[A-Za-z._:][A-Za-z0-9._:]*$

might be your best bet...

Tim Pietzcker 2009-11-30 16:20:53

where do you get that it must not start with "xml"? I don't think that is in the spec. See Welbog's answer.

harschware 2009-11-30 16:24:38

I read that here: http://de.selfhtml.org/xml/dtd/bearbeitungsregeln.htm#namen (sorry, it's in German) where it says that `xml` is reserved for later extensions of the standard.

Tim Pietzcker 2009-11-30 16:25:57

AFAIK, `xml`-tags are reserved for, well.. xml-specific tags^^

ApoY2k 2009-11-30 16:27:17

There is of course the "xml" tag at the beginning of xml docs, but the spec does not place a restriction on those tags that begin with "xml". Again see the grammar that welbog posted.

harschware 2009-11-30 16:48:30

ansaurus

tags:

views:

answers:

Regex for reformatting a string to a safe xml-tag

related questions