views:

45

answers:

2

I'm wondering if this is possible.

I have html like so:

<p>
  <font face="Georgia">
    <b>History</b><br>&nbsp; <br>Two of the polysaccharides used in the manufacture of...</font>
    <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank">
    <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status.&nbsp; 
    </font>
</p>

<p>
  <font face="Georgia">[READMORE]</font>
</p>

<p><font face="Georgia"><br><strong>Proprietary Composition</strong><br>
   <br>The method in which soluble fibres are made into... REST OF ARTICLE...
</p>

Yes, it's ugly html and it comes from a WYSIWYG so I have little control over it.

What I want to do is search for [READMORE] in the document, remove any parent tags ( in this case, the <font> and the <p> tags ) and replace them with a readmore link while wrapping the REST of the document in a giant `...rest of article...

I'm pretty sure the HtmlAgilityPack will get me part of the way there, but I'm just trying to figure out where to start.

So far, I'm pretty sure that I have to use htmlDoc.DocumentNode.SelectSingleNode(//p[text()="[READMORE]"]) or something. I'm not too familiar with XPATH.

For my documents, the readmore may or may not be in a nested font tag.

Also, in some cases, it may not be in a tag at all, but rather at the document root. I can just do a regular search and replace in that case and it should be straightforward.

My ideal situation would be something like this (PSEUDOCODE)

var node = SelectNodeContaining("[READMORE]").

node.Replace( "link here" );

node.RestOfDocument().Wrap("<div class='wrapper'");

I know, I'm dreaming... but I hope this makes sense.

A: 

If i am right then , You can try one thing...as the same thing we do in sending custom html mails

  1. Create a template of your html page with static contents.
  2. Append identifiers for dynamic contents as you have stated [ReadMore] or {ReadmOre} or something similar to that.
  3. Now read the template html file line by line and replace the identifiers with desired text.
  4. Now save the entire string to a new html file or do whatever you want.
Amit Ranjan
That's the plan. However, if I replace [Readmore] with a link and encapsulate the rest of the article from that point forward in a div tag, I will have unclosed tags. I need to remove the parents of [readmore] (if they exist) and then do it. I'm stuck on a consistent way to remove them.
Atømix
+1  A: 

Here is an XSLT solution:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="p[descendant::text()[. = '[READMORE]']]">
  <a href="#ReadmoreWrapper">READMORE</a>
  <div class="wrapper" id="#ReadmoreWrapper">
   <xsl:apply-templates select="following-sibling::node()" mode="copy"/>
  </div>
 </xsl:template>

 <xsl:template match=
  "node()[ancestor::p[descendant::text()[. = '[READMORE]']]
         or
          preceding::p[descendant::text()[. = '[READMORE]']]
          ]
  "/>

  <xsl:template match="node()|@*" mode="copy">
      <xsl:copy>
       <xsl:apply-templates select="node()|@*" mode="copy"/>
      </xsl:copy>
  </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the following XML document:

<html>
<p>
  <font face="Georgia">
    <b>History</b><br/>&#xA0; <br/>Two of the polysaccharides used in the manufacture of...</font>
    <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank"/>
    <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status.&#xA0;
    </font>
</p>

<p>
  <font face="Georgia">[READMORE]</font>
</p>

<p><font face="Georgia"><br/><strong>Proprietary Composition</strong><br/>
   <br/>The method in which soluble fibres are made into... REST OF ARTICLE...
   </font>
</p>

</html>

the wanted result is produced:

<html>
    <p>
        <font face="Georgia"><b>History</b><br/>  <br/>Two of the polysaccharides used in the manufacture of...</font>
        <a title="PubMed" href="http://www.www.gov/pubmed/" target="_blank"/>
        <font face="Georgia">) and this web site for new development by...well as Self Affirmed Medical Food GRAS status. 
    </font>
    </p>
    <a href="#ReadmoreWrapper">READMORE</a>
    <div class="wrapper" id="#ReadmoreWrapper">
        <p>
            <font face="Georgia"><br/><strong>Proprietary Composition</strong><br/><br/>The method in which soluble fibres are made into... REST OF ARTICLE...
   </font>
        </p>
    </div>
</html>
Dimitre Novatchev
It appears like it would work, but I'm having parsing errors. It doesn't like ` ` in the text when Parsing as an XML Doc. Can the same XSLT Transform be done on a `HtmlAgilityPack.HtmlDocument`?"
Atømix
I thought that HtmlAgilityPack produces an XML document. If this isn't true, you could convert its HTML DOM to XML tree (DOM) programmatically. When I wrote the transformation, I replaced all `@nbsp;` and also all unclosed tags like `<br>` with `<br />` and add some ending `</font>` tags. Most probably these people have a serializer to XML.
Dimitre Novatchev