views:

243

answers:

1

Using the HTML Agility Pack, how can I remove all HTML attributes, elements, etc, etc, from a blob of HTML, with the result as if I pasted it into notepad?

Additionally, I need to remove all formatting but I need to keep UL/LI and B tags.

+4  A: 

Enter the html into an HtmlDocument instance, you can get the HtmlNode returned by the DocumentNode property, and from there, get the InnerText property of the document node. It will give you all the text stripped of HTML tags.

If you want to only include a particular subset of nodes in your filtering, then it's going to be a little more difficult.

First, you would load the content into an HtmlDocument instance and get the HtmlNode instance returned by the DocumentNode property (I'll refer to this node from this document as the root node).

At the same time, you would also create a second HtmlDocument instance which would contain the new document you are creating.

On the first document, you would iterate through the root node recursively (note, it doesn't have to be an actual recursive method, but semantically it would be recursive behavior), analyzing the node and all of it's children nodes.

If the node itself is one of the nodes you approve of, then you would begin to construct a new instance of that node.

However, if it is not, you would still process the child nodes of the element, getting the text node content (since text in itself is a node) and appending it to whatever current node is on the stack (if there is one).

casperOne
Hi, I expanded my question a bit. Please see if you can comment to this also
kaivalya
@kaivalya: Updated the question and my answer.
casperOne