tags:

views:

30

answers:

2

In the process of trying to make a stylesheet to convert old LoC transcriptions of books that used a very-outdated SGML DTD for formatting, I've run into a roadblock at the following situation:

In the converted XML files, there are some lines of text like the following:

<p> Text on left <hsep></hsep> Text on right </p>

hsep essentially pushes the remaining text to be right-justified. Unfortunately, I don't know of any way to convert this to HTML by just converting tags, as HTML has nothing like hsep short of dubious CSS hacks. I think it would be more useful to be able to convert this to something like:

<p> Text on left <span class="right">Text on right</span> </p>

However, I'm not sure how to do this, as it would require that, in the <p> element, I determine whether there's an <hsep> and then create a tag surrounding the remaining text based on it being there, while also applying templates to any elements that might be there. I don't think cases where I have something like

<p> Text a <em> Text b <hsep></hsep> Text c </em> </p>

are common or even present, so I don't think that will pose a problem, but there may be situations like:

<p> <em> Text a Text b <hsep></hsep> Text c </em> </p>

I can think of complicated, horrible ways of doing this involving regexes, but I'm hoping there's a non-horrible way.

+1  A: 

This transformation:

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*" name="identity">
     <xsl:copy>
       <xsl:apply-templates select="node()|@*"/>
     </xsl:copy>
 </xsl:template>

 <xsl:template match="hsep">
  <span class="right">
   <xsl:apply-templates mode="copy"
        select="following-sibling::node()"/>
  </span>
 </xsl:template>

 <xsl:template match="node()[preceding-sibling::hsep]"/>

 <xsl:template mode="copy"
  match="node()[preceding-sibling::hsep]">

  <xsl:call-template name="identity"/>
 </xsl:template>
</xsl:stylesheet>

when applied on this document:

<html>
  <p> Text a <em> Text b <hsep></hsep> Text c </em> </p>
  <p> <em> Text a Text b <hsep></hsep> Text c </em> </p>
</html>

produces the wanted, correct result:

<html>
   <p> Text a <em> Text b <span class="right"> Text c </span></em></p>
   <p><em> Text a Text b <span class="right"> Text c </span></em></p>
</html>
Dimitre Novatchev
+1  A: 

create a tag surrounding the remaining text based on it being there, while also applying templates to any elements that might be there

I think that for better foward processing you could use this stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:template match="node()|@*" name="identity">
        <xsl:copy>
            <xsl:apply-templates select="node()[1]|@*"/>
        </xsl:copy>
        <xsl:apply-templates select="following-sibling::node()[1]"/>
    </xsl:template>
    <xsl:template match="hsep">
        <span class="right">
            <xsl:apply-templates select="following-sibling::node()[1]"/>
        </span>
    </xsl:template>
</xsl:stylesheet>

With Dimitre's input:

<html>
  <p> Text a <em> Text b <hsep></hsep> Text c </em> </p>
  <p> <em> Text a Text b <hsep></hsep> Text c </em> </p>
</html>

Output:

<html>
<p> Text a <em> Text b <span class="right"> Text c </span></em></p>
<p><em> Text a Text b <span class="right"> Text c </span></em></p>
</html>

Note: With out mode you can declare a rule once for elements whether preceding or following hsep.

Alejandro
Hi, Alejandro. Can you explain how your method works, and why it has the "better forward processing" characteristics? I ran your XSLT against Dimitre's and yours came out between 3X and 4X faster. If there's already an explanation online, can you point me there? I do not know how to search for this pattern... does it have a name like, "Identity Transform"? Thank you.
Zachary Young
@Zachary Young: This pattern is called "most fine grained transverse", fundamental one as well as the identity transformation and navegate the tree node by node in document order. "Better foward processing" means that if you have `hsep` 's sibling elements to transform (as example, `i` to `span class="italic"`) you could declare only one rule for those unlike mode process wich needs rules for every mode to catch precedings without mode and followings with mode. About performance, the only reason I can think of is that you have a lot `hsep` followings getting process twice (one with each mode).
Alejandro