views:

845

answers:

2

Has anyone done this? Basically, I want to use the html by keeping basic tags such as h1, h2, em, etc; clean all non http addresses in the img and a tags; and HTMLEncode every other tag.

I'm stuck at the HTML Encoding part. I know to remove a node you do a "node.ParentNode.RemoveChild(node);" where node is the object of the class HtmlNode. Instead of removing the node though, I want to HTMLEncode it.

+1  A: 

You would need to remove the node representing the element you don't want. The encoded HTML would then need to be re-added as a text node.

If you don't want to process the children of the elements that you want to throw away, you should be able to just use OuterHtml ... something like this might work:

node.AppendChild(new HtmlTextNode { Text = HttpUtility.HtmlEncode(nodeToDelete.OuterHtml) });
nullptr
A: 

The answer above pretty much covers it. There's one thing to add, though.

You don't want to change a particular node, but all of them, so the code above will probably be a method, wrapped in an if statement ( to make sure it's a tag you want to HtmlEncode ). More to the point, since Agility Pack doesn't expose nodes by ordinal, you can't iterate the entire document. Recursion is the easiest way to go about it. You probably already know this...

I tackled a similar problem, and have some shell code (C#) you're more than welcome to use: http://dev.forrestcroce.com/normalizer-of-web-pages-qualifier-of-urls/2008-12-09/