views:

61

answers:

2

I have a simple doc.xml file which contains a single root element with a Timestamp attribute:

<?xml version="1.0" encoding="utf-8"?>
<root Timestamp="04-21-2010 16:00:19.000" />

I'd like to validate this document against a my simple schema.xsd to make sure that the Timestamp is in the correct format:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"&gt;
  <xs:element name="root">
    <xs:complexType>
      <xs:attribute name="Timestamp" use="required" type="timeStampType"/>
    </xs:complexType>
  </xs:element>
  <xs:simpleType name="timeStampType">
    <xs:restriction base="xs:string">
      <xs:pattern value="(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3} ([0-1]{1}[0-9]{1}|2[0-3]{1}):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}" />
    </xs:restriction>
  </xs:simpleType>
</xs:schema>

So I use the lxml Python module and try to perform a simple schema validation and report any errors:

from lxml import etree

schema = etree.XMLSchema( etree.parse("schema.xsd") )
doc = etree.parse("doc.xml")

if not schema.validate(doc):
    for e in schema.error_log:
        print e.message

My XML document fails validation with the following error messages:

Element 'root', attribute 'Timestamp': [facet 'pattern'] The value '04-21-2010 16:00:19.000' is not accepted by the pattern '(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3} ([0-1]{1}[0-9]{1}|2[0-3]{1}):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}'.
Element 'root', attribute 'Timestamp': '04-21-2010 16:00:19.000' is not a valid value of the atomic type 'timeStampType'.

So it looks like my regular expression must be faulty. But when I try to validate the regular expression at the command line, it passes:

>>> import re
>>> pat = '(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3} ([0-1]{1}[0-9]{1}|2[0-3]{1}):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}'
>>> assert re.match(pat, '04-21-2010 16:00:19.000')
>>> 

I'm aware that XSD regular expressions don't have every feature, but the documentation I've found indicates that every feature that I'm using should work.

So what am I mis-understanding, and why does my document fail?

+3  A: 

Your |s match wider than you think.

(0[0-9]{1})|(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3}

is parsed as:

(0[0-9]{1})
    -or-
(1[0-2]{1})-(3[0-1]{1}|[0-2]{1}[0-9]{1})-[2-9]{1}[0-9]{3}

You need to use more groupings if you want to avoid it; e.g.

((0[0-9]{1})|(1[0-2]{1}))-((3[0-1]{1}|[0-2]{1}[0-9]{1}))-[2-9]{1}[0-9]{3} (([0-1]{1}[0-9]{1}|2[0-3]{1})):[0-5]{1}[0-9]{1}:[0-5]{1}[0-9]{1}.[0-9]{3}
Michael Mrozek
That's the reason it's failing to match valid input, but the revised regex can still match a lot of invalid input, as @Daniel points out.
Alan Moore
Well yeah, but that was his question :)
Michael Mrozek
+3  A: 

The expression has several errors.

  1. You allow 00 as a valid month.
  2. A|BC matches A and BC - not AC and BC. Hence your expression starting with (0[0-9]{1})| matches any string containing 00 through 09. What you want is (0[1-9]|1[0-2])- only matching 01 through 12 followed by a dash.
  3. You allow 00 as a valid day.
  4. The pattern is not anchored to the beginning and end of the text - add ^ and $. That is why your test using Python succeeded.

By the way - why don't you use xs:dateTime? It has a very similar format - yyyy-mm-ddThh:mm:ss.fff I think.

Daniel Brückner
To elaborate on point #4: the Python regex should have *added* anchors to mimic the behavior of XSD regexes, which are implicitly anchored. (Also, the `.` near the end should be `\.`.) But as you said, with `xs:dateTime` available, this is probably all academic anyway.
Alan Moore
Thanks for the points and suggestions about `xs:dateTime` - as it is, I've been handed a schema and xml doc written by someone else and am just trying to figure out how to make them pass with a minimum of changes, but I'll definitely remember the `xs:dateTime` type for when I next construct a schema myself.
Eli Courtwright