tags:

views:

589

answers:

6

I need to postprocess some HTML that has bad structure -- e.g.

<html>
<body>...</body>
<body>...</body>
</html>

What's the best way to transform this HTML so that the contents of the second body appear inside the first, except of course the extra body tag? I don't want to manipulate anything else with this rule.

I've thought of matching on the html tag and handling it from there using explicit apply-templates calls, but it seems a little sloppy to me. I know how to match the spurious bodies ("body[position() > 1]") but I'd like some ideas on how best to write the transform.

Edit: I do need to apply other templates to children of all of these elements, so a simple copy won't work.

And I would like to preserve comments and processing instructions. I want pretty much the entire document as an identity transform, except for these multiple bodies and some other minor edits, which I am already doing successfully.

Edit 2: It is important to keep the children of the second body element in the above example. They should be children of the first body tag in the output, at the end of the child nodes of the first body tag.

Edit 3: Here is some illustrative input/output (not checked for validity):

<html>
  <!-- Look at my comments -->
  <head>
    <title>My title!</title>
    <!-- Commentary -->
  </head>
  <body>
     <p>Something <b>bold</b></p>
  </body>
  <body>
     <!-- heh -->
     <p>Some bozo put my parent in here.</p>
  </body>
  <body>
     <p>More stuff here</p>
  </body>
</html>

needs to be:

<html>
  <!-- Look at my comments -->
  <head>
    <title>My title!</title>
    <!-- Commentary -->
  </head>
  <body>
     <p>Something <b>bold</b></p>
     <!-- heh -->
     <p>Some bozo put my parent in here.</p>
     <p>More stuff here</p>
  </body>
</html>
+1  A: 

Usually avoiding problems downstream by writing tailored hacks will result in badly manageable codebase.

You should preferably repair the broken HTML at it's source instead, having several body tags sounds like a severe misunderstanding somewhere.

Jukka Dahlbom
Unfortunately that's not really possible in this case.
Steven Huwig
+1  A: 

If your input HTML is well-formed XML, then this XSLT template would do it:

<xsl:template match="/">
  <body>
    <xsl:copy-of select="//body/node()" />
  </body>
</xsl:template>

(I did not care for the <html> node in this example, since this is trivial.)

A more flexible variant of the above (as per the OP's request)

<!-- explicitly catching the initial html circumvents built-in templates -->
<xsl:template match="/html">
  <xsl:copy>
    <xsl:apply-templates />
  </xsl:copy>
</xsl:template>

<!-- copy everything that is not processed otherwise -->
<xsl:template match="@*|node()|processing-instruction()">
  <xsl:copy-of select="." />
</xsl:template>

<!-- matches any "body" node, but produces output only for the first -->
<xsl:template match="body">
  <xsl:if test="not(preceding-sibling::body)">
    <xsl:copy>
      <xsl:apply-templates select="//body/@*|//body/node()" />
    </xsl:copy>
  </xsl:if>
</xsl:template>

<!-- you can add more of these specific templates, as needed -->
<xsl:template match="body//a">
  <b>
    <xsl:copy-of select="." />
  </b>
</xsl:template>

This input:

<html>
  <head><title>Foo!</title></head>
  <?dummy processing instruction?>
  <body foo="bar">...<a href="foo">asd</a><!-- comment --></body>
  <body>...contents of body#2...</body>
</html>

gets me this result (white-space and indentation changed for readability):

<html>
  <head><title>Foo!</title></head>
  <?dummy processing instruction?>
  <body foo="bar">
    ...
    <b><a href="foo">asd</a></b>
    <!-- comment -->
    ...contents of body#2...
  </body>
</html>
Tomalak
I should have told you that I do need to apply other templates to children of body.
Steven Huwig
Well, use <xsl:apply-templates select="//body//node()" />, then.
Tomalak
And head, and comments and processing directives. The trouble is that doing much in the parent HTML context seems to commit me to explicitly naming/applying things I don't want to name/apply, hence "messy."
Steven Huwig
A: 

If the HTML is that messed up then I would be loath to assume that the HTML is well formed enough to use xlst. You might want to just use regular expressions to find the

<body>(whitespace)</body>

and remove it.

Keltex
The HTML is being processed by TagSoup and is thus well-formed XML.
Steven Huwig
+1  A: 

Perhaps this is closer to what you were after:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            version="2.0" exclude-result-prefixes="xsl">
<xsl:output indent="yes" method="html"/>

<xsl:template match="/">
 <xsl:apply-templates select="@*|node()"/>
</xsl:template>

<!-- Identity Template -->
<xsl:template match="@*|node()">
 <xsl:copy>
  <xsl:apply-templates select="@*|node()"/>
 </xsl:copy>
</xsl:template>

<!-- Matches on the first 'body' tag -->
<xsl:template match="body[1]">
 <xsl:copy>
  <!-- apply=templates the children of all the body tags -->
  <xsl:apply-templates select="//body/node()"/>
 </xsl:copy>
</xsl:template>

<!-- Skip processing on the subsequent body tags 
     (their children are still processed however)   -->
<xsl:template match="body"/>

</xsl:stylesheet>

This uses the popular 'push' structure for the templates, so you might find it more flexible.

Jweede
That's close to what I have, but I want to keep the children of the other body tags, just merge them into the end of the first. I believe this stylesheet removes them.
Steven Huwig
Note that using "//body" finds all "body" elements in the document, not just the ones that are children of the "html" element.
Robert Rossney
@Steven although it looks like that's what its doing, take a closer look. Since we're applying templates it does copy the contents of each body element.
Jweede
@Robert Thanks for pointing that out. To be more specific the XPath could be '/html/body/node()'
Jweede
+1  A: 

I think @Keltex meant that you should strip out

</body>\s*<body>

before processing the document, so that you can write the XSLT as you would for a normalized input.

That's what I'd do.

(This assumes that the multiple body tags have no content between them.)

EDIT: This would not remove the content of the body tags. Note that you'd be removing anything from a closing body tag to an opening one. This would leave in place the initial and final tags. In other words with input like this

<body>
    good stuff
</body>
<body>
    more good stuff
</body>

you'd be targeting those two tags in the middle. Removing these would yield a single, continuous body:

<body>
    good stuff
    more good stuff
</body>
harpo
But I want the contents of those extra body tags. They show up in browsers.
Steven Huwig
Oh, I misread what you were saying. That's an idea.
Steven Huwig
Yeah that was what I was saying.
Keltex
+2  A: 

Add these templates to the identity transform:

<xsl:template match="/html/body[1]">
   <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
      <xsl:apply-templates select="/html/body[2]/node() | /html/body[2]/@*"/>
   </xsl:copy>
</xsl:template>

<xsl:template match="/html/body"/>

Edit:

To be belt-and-suspenders about it, instead of body[2] in the above you could use body[position() != 1]. That would handle the case where your input had more than two body elements.

Robert Rossney
Steven Huwig
Mrs. Rossney didn't raise any stupid children. Actually that's not true, but it's not me.
Robert Rossney