tags:

views:

84

answers:

3

Why next code returns true (Saxon-EE 9.2 for .NET)?

matches('some text>', '^[\w ]{3,200}$')

There is no > symbol in the pattern. Thanks.

XQuery:

<regexp-test>
    <!-- why true? -->
    <test1>{matches('some text>', '^[\w ]{3,200}$')}</test1>
    <test2>{matches('some text>', '^[\w ]+$')}</test2>
    <test3>{matches('&lt; < >', '^[\w ]+$')}</test3>
    <!-- valid: --> 
    <test4>{matches('some text!', '^[\w ]+$')}</test4>  
    <test5>{matches('.,', '^[\w ]+$')}</test5> 
</regexp-test>

Output:

<regexp-test>
  <!-- why true? -->
  <test1>true</test1>
  <test2>true</test2>
  <test3>true</test3>
  <!-- valid: -->
  <test4>false</test4>
  <test5>false</test5>
</regexp-test>
A: 

I'll have a go...

I will guess that you meant to write

matches( 'some text' , '^[\w ]{3,200}$' )

The regex says to start at the beginning of the string (^), match at least 3, and at most 200 ({3,200}) characters or spaces ([\w ]), and then expect the end of the string ($).

So, some text will return true since it consists of the right characters [a-zA-Z0-9_ ] and there are 9 of them.

If the match is this, for example

matches( 'some text' , '^[\w ]{3,5}$' )

The result should return false since it will only match strings of length 3 to 5.

Perhaps you think \w means whitespace or something else?

philcolbourn
I think asker is wondering about the `>` present in the text but not in the pattern.
AakashM
Thank you, I know regular expressions, and their bases, that you have described. But it was not a misprint, I meant it was a string with ">" (or other symbol, not a character: >, <,!, =, Etc.). The pattern does not have these characters, but nevertheless string "some text >" matches.The same pattern works correctly in the implementation of regular expressions in .NET or Java. But in Saxon regexp something wrong.
chardex
A: 

> is not a valid character in a string in this situation and needs to be replaced by its representation &gt;. I guess it is being silently dropped and therefore the regex matches.

See also w3schools.com: "XQuery is case-sensitive and XQuery elements, attributes, and variables must be valid XML names." - and > is not allowed inside XML attributes.

Tim Pietzcker
I got same result when > replaced with >. matches('< < >', '^[\w ]+$') -- returns true.
chardex
What happens with `matches('>', '^[\w ]+$')`?
Tim Pietzcker
It returns true.
chardex
Weird. Must be a bug in Saxon-EE or something.
Tim Pietzcker
A: 

After some digging, experimentation and help from others in the eXist community, I find that the definition of character classes in UNICODE and used in the definition of regexps in XPath and XML schema is different to the POSIX classes. In particular the characters

$+<=>^|~

are in the Symbol class \p{S} not the Punctuation class \p{P}. Since the definition of \w ( from http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/datatypes-with-errata.html ) is

"[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters) "

these characters will be included in \w

This leads to a workaround using [^\W\p{S}]

Chris Wallace