tags:

views:

1998

answers:

2

Hi

I want to replace some HTML tags that I have in a CDATA element, but I struggle with getting the syntax in XSLT right. I am getting this error message:

net.sf.saxon.trans.XPathException: Error at character 9 in regular 
expression "<img(\s+(?![^<>]*alt=["\'])[^<...": expected ()) (line 51)

I guess it does not like the <> inside the regEx. Does anyone knows how to write this in XSLT?

Here is the regEx:

<xsl:variable name="imgTagWithoutAltAttributePattern">
<xsl:text disable-output-escaping="yes">&lt;img(\s+(?![^&lt;&gt;]*alt=["\'])[^&lt;&gt;]+)/&gt;</xsl:text></xsl:variable>

Thanks in advance,

T

+2  A: 

I don't think that the escaped <> brackets are the source of the problem.

Looking at the error message, the error is at char 9, where a closing parentheses ")" is expected:

<img(\s+(?![^<>]*alt=["\'])[^<...
--------^

As you can see, the "&lt;&gt;" comes out just fine. I suspect that the regex engine does not understand the regex in some other way (maybe the negative look-ahead is the problem?).

I suggest to try a simpler regex at first, breaking your original one down in different tests to single out the problem:

<img\s[^>]+/>                          // test without look-ahead
<img(?=\s)[^>]+/>                      // test with positive look-ahead
<img(?!\S)[^>]+/>                      // test with negative look-ahead
<img((?!\S))[^>]+/>                    // negative look-ahead in parentheses 
<img\s(?![^>]+alt=["'])[^>]+/>         // your intention, expressed differently

This way you could inch your way to the cause of the error.

EDIT

By the OP's own statement, using look-ahead in the regular expression causes the error, so obviously look-ahead is not supported by this regex engine.

To match only <img> tags that don't contain alt attributes look-around is not absolutely required. I propose a different approach:

<img\s(a[^l]|al[^t]|alt\s*[^=]|[^a>])*>           // literal form
&lt;img\s(a[^l]|al[^t]|alt\s*[^=]|[^a&gt;])*&gt;  // XML-encoded form

Credit for this little beast goes to: J.F. Sebastian. Here is the explanation:

<img\s          ....... start of img tag
  (             ....... start of alternatives: either
    a[^l]       ....... "a", not followed by "l"
    |           ....... or
    al[^t]      ....... "al", not followed by "t"
    |           ....... or
    alt\s*[^=]  ....... "alt", not followed by an equals sign
    |           ....... or
    [^a>]       ....... neither "a" nor ">"
  )*            ....... end of alternatives, repeat as often as possible
>               ....... end of image tag

The standard disclaimer applies: Regex is not the best tool for processing HTML. Use at your own risk.

Tomalak
A: 

hmm! Only the first test went through. Yes, the problem seems to start with the paranthes.

Will look more into it tomorrow. Thanks so far.

T

I don't think XSLT regexes support lookaheads.
Alan Moore
I edited my answer to provide a possible alternative.
Tomalak