tags:

views:

384

answers:

6

I need a regex or a function in PHP that will validate a string to be a good XML element name.

Form w3schools:

XML elements must follow these naming rules:

  1. Names can contain letters, numbers, and other characters
  2. Names cannot start with a number or punctuation character
  3. Names cannot start with the letters xml (or XML, or Xml, etc)
  4. Names cannot contain spaces

I can write a basic regex that will check for rules 1,2 and 4, but it won't account for all punctuation allowed and won't account for 3rd rule

\w[\w0-9-]
A: 
if (substr(strtolower($text), 0, 3) != 'xml') && (1 === preg_match('/^\w[^<>]+$/', $text)))
{
    // valid;
}
Coronatus
A: 

This should give you roughly what you need [Assuming you are using Unicode]:
(Note: This is completely untested.)

[^\p{P}xX0-9][^mMlL\s]{2}[\w\p{P}0-9-]

\p{P} is the syntax for Unicode Punctuation marks in PHP's regular expression syntax.

Sean Vieira
Among other problems, that won't match anything that starts with 'x' or has 'm' or 'l' as the second or third characters. That disallows a lot more than just "xml".
Alan Moore
@Alan; very valid point. Could you use negative look-aheads instead? (More for curiosity than anything else. Gordon's way is far better than what I posted off-hand.)
Sean Vieira
That's right. @Mef's answer has its own problems, but it demonstrates how to use a lookahead for that part of the job.
Alan Moore
+2  A: 

How about

/\A(?!XML)[a-z][\w0-9-]*/i

Usage:

if (preg_match('/\A(?!XML)[a-z][\w0-9-]*/i', $subject)) {
    # valid name
} else {
    # invalid name
}

Explanation:

\A  Beginning of the string
(?!XML)  Negative lookahead (assert that it is impossible to match "XML")
[a-z]  Match a non-digit, non-punctuation character
[\w0-9-]*  Match an arbitrary number of allowed characters
/i  make the whole thing case-insensitive
Mef
This doesn’t match <äøñ> which is a valid Nmtoken as of XML 1.1. See http://www.w3.org/TR/xml11/#sec-common-syn
toscho
Hmm... never dealt with Unicode in regexes. Any suggestions?
Mef
This expression with some mods for unicode plus filter_var() should do the job. Thanks.
xsaero00
+5  A: 

If you want to create valid XML, use the DOM Extension. This way you don't have to bother about any Regex. If you try to put in an invalid name to a DomElement, you'll get an error.

function isValidXmlName($name)
{
    try {
        new DOMElement($name);
        return TRUE;
    } catch(DOMException $e) {
        return FALSE;
    }
}

Note that this won't throw an Exception when $name is or starts with xml. Add

if(stripos($name, 'xml') === 0) return false;

before the try/catch block if you want exclude this too.

Gordon
This introduce lots of overhead for just checking an element name. I do use DOM objects when I am ready to do actual XML processing.
xsaero00
@xsaero00 well, first of all: we usually don't downvote all answers we didn't accept. All of the answers given contain valid approaches to your problem. Second, I have benchmarked my solution (incl. strpos) versus the accepted solution and incidentally my solution is 250% faster. If you don't believe it, do a benchmark yourself.
Gordon
A: 

use this regex:

^_?(?!(xml|[_\d\W]))([\w.-]+)$

This matches all your four points and allows unicode characters.

Timo Zimmermann
A: 

Inspired by mef nice answer, but with and ending '$' (otherwise XML names containing spaces like 'aaa bbb' will be accepted)

$validXmlName = (preg_match('/^(?!XML)[a-z][\w0-9-]*$/i', $subject) != 0);
Frosty Z