views:

37

answers:

1

I've simplified the problem somewhat, but I hope I've still captured the essence of my problem.

Let's say I have the following simple XML file:

<main>
  outside1
  ===BEGIN===
    inside1
  ====END====
  outside2
  =BEGIN=
    inside2
  ==END==
  outside3
</main>

Then I can use the following the XSLT 2.0:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

<xsl:template match="text()">

  <xsl:analyze-string select="." regex="=+BEGIN=+">
     <xsl:matching-substring>
        <section/>
     </xsl:matching-substring>
     <xsl:non-matching-substring>
          <xsl:analyze-string select="." regex="=+END=+">  
             <xsl:matching-substring>
                <_section/>
             </xsl:matching-substring>
             <xsl:non-matching-substring>
                <xsl:value-of select="."/>
             </xsl:non-matching-substring>
          </xsl:analyze-string>
     </xsl:non-matching-substring>
  </xsl:analyze-string>

</xsl:template>

</xsl:stylesheet>

To transform it to the following:

<?xml version="1.0" encoding="UTF-8"?>
  outside1
  <section/>
    inside1
  <_section/>
  outside2
  <section/>
    inside2
  <_section/>
  outside3

Here are the questions:

Multiple regexes

Is there a better way to match two different regexes rather than nesting them inside another like what was done above?

  • What if they're not easily nestable like this?
  • Can I have XSL templates to match and transform regex matches in a text()?
    • In this case, I'd have two templates, one for each regex
    • If possible, this would be the ideal solution

Opening and closing elements on regex matches

Obviously, instead of:

<section/>
   inside
<_section/>

What I really want eventually is:

<section>
   inside
</section>

So how would you do this? I'm not sure if it's even possible to open an element in one regex match and close it in another (i.e. What if there is no match for the closer? The result will not be well-formed XML!), but it seems like this task is quite typical that there has to be an idiomatic solution for them.

Note: we can assume that sections will not overlap, and thus also will not nest. We can also assume that they will always appear in proper pairs.


Additional info

So essentially I'm trying to accomplish what in Perl would succintly be something like:

s/=+BEGIN=+/<section>/
s/=+END=+/<\/section>/

I'm looking for a way to do this in XSLT instead, because:

  • It'd be more robust with regards to the context of the regex match
    • (i.e. it should only transform text() nodes)
  • It'd also be more robust with regards to matching various XML entities
+1  A: 

This transformation:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 exclude-result-prefixes="xs"
>
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
    <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="text()">
   <xsl:analyze-string select="." flags="mx"
    regex="=+BEGIN=+((.|\n)*?)=+END=+">

   <xsl:matching-substring>
    <section><xsl:value-of select="regex-group(1)"/></section>
   </xsl:matching-substring>

   <xsl:non-matching-substring>
    <xsl:value-of select="."/>
   </xsl:non-matching-substring>
 </xsl:analyze-string>
 </xsl:template>
</xsl:stylesheet>

when applied on the provided XML document:

<main>
  outside1
  ===BEGIN===
    inside1
  ====END====
  outside2
  =BEGIN=
    inside2
  ==END==
  outside3
</main>

produces the wanted result:

<main>
  outside1
  <section>
    inside1
  </section>
  outside2
  <section>
    inside2
  </section>
  outside3
</main>
Dimitre Novatchev
Yep, I was afraid that I had to do this (i.e. match the whole section). Now what if I have other regex transformation that i want to apply to both inside and outside text? What's the best way to do it? Name the template and call it at both `matching` and `non-matching` branches?
polygenelubricants
@polygenelubricants: Not named template but an `<xsl:function>` -- this is very convenient.
Dimitre Novatchev
Dimitre Novatchev