ansaurus

Question

Answer 1

+1 A:

Usually avoiding problems downstream by writing tailored hacks will result in badly manageable codebase.

You should preferably repair the broken HTML at it's source instead, having several body tags sounds like a severe misunderstanding somewhere.

Jukka Dahlbom 2009-03-23 14:21:51

Unfortunately that's not really possible in this case.

Steven Huwig 2009-03-23 17:10:47

Answer 2

+1 A:

If your input HTML is well-formed XML, then this XSLT template would do it:

<xsl:template match="/">
  <body>
    <xsl:copy-of select="//body/node()" />
  </body>
</xsl:template>

(I did not care for the <html> node in this example, since this is trivial.)

A more flexible variant of the above (as per the OP's request)

<!-- explicitly catching the initial html circumvents built-in templates -->
<xsl:template match="/html">
  <xsl:copy>
    <xsl:apply-templates />
  </xsl:copy>
</xsl:template>

<!-- copy everything that is not processed otherwise -->
<xsl:template match="@*|node()|processing-instruction()">
  <xsl:copy-of select="." />
</xsl:template>

<!-- matches any "body" node, but produces output only for the first -->
<xsl:template match="body">
  <xsl:if test="not(preceding-sibling::body)">
    <xsl:copy>
      <xsl:apply-templates select="//body/@*|//body/node()" />
    </xsl:copy>
  </xsl:if>
</xsl:template>

<!-- you can add more of these specific templates, as needed -->
<xsl:template match="body//a">
  <b>
    <xsl:copy-of select="." />
  </b>
</xsl:template>

This input:

<html>
  <head><title>Foo!</title></head>
  <?dummy processing instruction?>
  <body foo="bar">...<a href="foo">asd</a><!-- comment --></body>
  <body>...contents of body#2...</body>
</html>

gets me this result (white-space and indentation changed for readability):

<html>
  <head><title>Foo!</title></head>
  <?dummy processing instruction?>
  <body foo="bar">
    ...
    <b><a href="foo">asd</a></b>
    <!-- comment -->
    ...contents of body#2...
  </body>
</html>

Tomalak 2009-03-23 14:31:21

I should have told you that I do need to apply other templates to children of body.

Steven Huwig 2009-03-23 15:40:21

Well, use <xsl:apply-templates select="//body//node()" />, then.

Tomalak 2009-03-23 16:52:42

And head, and comments and processing directives. The trouble is that doing much in the parent HTML context seems to commit me to explicitly naming/applying things I don't want to name/apply, hence "messy."

Steven Huwig 2009-03-23 17:17:23

Answer 3

A:

If the HTML is that messed up then I would be loath to assume that the HTML is well formed enough to use xlst. You might want to just use regular expressions to find the

<body>(whitespace)</body>

and remove it.

Keltex 2009-03-23 15:56:26

The HTML is being processed by TagSoup and is thus well-formed XML.

Steven Huwig 2009-03-23 17:12:37

Answer 4

+1 A:

Perhaps this is closer to what you were after:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
            version="2.0" exclude-result-prefixes="xsl">
<xsl:output indent="yes" method="html"/>

<xsl:template match="/">
 <xsl:apply-templates select="@*|node()"/>
</xsl:template>

<!-- Identity Template -->
<xsl:template match="@*|node()">
 <xsl:copy>
  <xsl:apply-templates select="@*|node()"/>
 </xsl:copy>
</xsl:template>

<!-- Matches on the first 'body' tag -->
<xsl:template match="body[1]">
 <xsl:copy>
  <!-- apply=templates the children of all the body tags -->
  <xsl:apply-templates select="//body/node()"/>
 </xsl:copy>
</xsl:template>

<!-- Skip processing on the subsequent body tags 
     (their children are still processed however)   -->
<xsl:template match="body"/>

</xsl:stylesheet>

This uses the popular 'push' structure for the templates, so you might find it more flexible.

Jweede 2009-03-23 16:02:21

That's close to what I have, but I want to keep the children of the other body tags, just merge them into the end of the first. I believe this stylesheet removes them.

Steven Huwig 2009-03-23 17:28:04

Note that using "//body" finds all "body" elements in the document, not just the ones that are children of the "html" element.

Robert Rossney 2009-03-23 18:17:49

@Steven although it looks like that's what its doing, take a closer look. Since we're applying templates it does copy the contents of each body element.

Jweede 2009-03-23 19:29:26

@Robert Thanks for pointing that out. To be more specific the XPath could be '/html/body/node()'

Jweede 2009-03-23 19:30:16

Answer 5

+1 A:

I think @Keltex meant that you should strip out

</body>\s*<body>

before processing the document, so that you can write the XSLT as you would for a normalized input.

That's what I'd do.

(This assumes that the multiple body tags have no content between them.)

EDIT: This would not remove the content of the body tags. Note that you'd be removing anything from a closing body tag to an opening one. This would leave in place the initial and final tags. In other words with input like this

<body>
    good stuff
</body>
<body>
    more good stuff
</body>

you'd be targeting those two tags in the middle. Removing these would yield a single, continuous body:

<body>
    good stuff
    more good stuff
</body>

harpo 2009-03-23 17:21:56

But I want the contents of those extra body tags. They show up in browsers.

Steven Huwig 2009-03-23 17:29:36

Oh, I misread what you were saying. That's an idea.

Steven Huwig 2009-03-23 18:03:59

Yeah that was what I was saying.

Keltex 2009-03-23 19:37:01

Answer 6

+2 A:

Add these templates to the identity transform:

<xsl:template match="/html/body[1]">
   <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
      <xsl:apply-templates select="/html/body[2]/node() | /html/body[2]/@*"/>
   </xsl:copy>
</xsl:template>

<xsl:template match="/html/body"/>

Edit:

To be belt-and-suspenders about it, instead of body[2] in the above you could use body[position() != 1]. That would handle the case where your input had more than two body elements.

Robert Rossney 2009-03-23 18:15:28

Steven Huwig 2009-03-23 23:45:51

Mrs. Rossney didn't raise any stupid children. Actually that's not true, but it's not me.

Robert Rossney 2009-03-24 03:32:08

ansaurus

tags:

views:

answers:

XSLT: help me fix multiple BODY tags

related questions