ansaurus

Question

How to remove named HTML Tags and Contents from a String?

Answer 1

+1 A:

Regexes and HTML is a sin...

Austin Salonen 2010-05-17 15:45:36

I'm sure this is ideal normally - however is way more than I need - just need to remove tag and contents for one thing - as long as everything between the head tags, and the head tags themselves are removed - that's all I need, don't need anything more than that.

RoguePlanetoid 2010-05-17 15:50:32

Unless performance is critical then I would still use HTML Agility pack as it's far more robust. You will also find that trying to parse HTML as XML is more problematic than you might think (eg. chracter entities).

Dan Diplo 2010-05-17 16:18:58

Answer 2

+1 A:

You can use string.Substring + string.IndexOf to extract the body XML element.

The code should be something like that:

MyHtml.Substring(sHtml.IndexOf("<body>"), sHtml.IndexOf("</body>") - sHtml.IndexOf("<body>") + 7);

2010-05-17 16:14:20

Extracting the Body from the Rest may be the right way to go, thanks!

RoguePlanetoid 2010-05-17 16:23:05

Answer 3

+1 A:

Extracting the Body was easier - here is the RegEx I am using:

@"\<body\>(.*?)\</body\>"

Now I can parse that normally with LINQ-to-XML!

RoguePlanetoid 2010-05-17 16:30:57

+1 easy and simple

Teddy 2010-05-17 16:35:55

Unless you're controlling the HTML and ensuring it is well-formed, `</body>` is not guaranteed to exist.

Austin Salonen 2010-05-17 18:02:47

The HTML is always the same in this case, however this is a good point that this element may not be present in all cases.

RoguePlanetoid 2010-05-17 18:53:37

ansaurus

tags:

views:

answers:

How to remove named HTML Tags and Contents from a String?

related questions