views:

60

answers:

3

I've tried to understand a few examples, including questions here so I apologise if this seems to me a duplicate but I cannot find a RegularExpression I can understand.
I have some HTML to parse using an XML parser - but I want to strip out the <head> </head> tags from this content as the rest is valid enough for normal XML Parsing. The tags <head> to </head> must be removed and their content so that the outer HTML is not affected <body> tags etc.
This is the section including the Head HTML I want removed for reference:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<html>
    <head>
    <link rel="stylesheet" type="text/css" href="/style/stylesheet.css" />
    <meta name="description" content="Information" />
    <base target="_top">
</head>
<body>
<!-- Body Here -->
</body>
</html>

I also need to strip the DocType, if this can be done using a RegEx then that would be great. The head is always the same - I want to remove from <head> to </head> inclusive only and if possible remove the DOCTYPE from the Text also.

Also this will need to work in Silverlight and use System.Text.RegularExpressions or similar to work.

+1  A: 

HTML Agility Pack

Regexes and HTML is a sin...

Austin Salonen
I'm sure this is ideal normally - however is way more than I need - just need to remove tag and contents for one thing - as long as everything between the head tags, and the head tags themselves are removed - that's all I need, don't need anything more than that.
RoguePlanetoid
Unless performance is critical then I would still use HTML Agility pack as it's far more robust. You will also find that trying to parse HTML as XML is more problematic than you might think (eg. chracter entities).
Dan Diplo
+1  A: 

You can use string.Substring + string.IndexOf to extract the body XML element.

The code should be something like that:

MyHtml.Substring(sHtml.IndexOf("<body>"), sHtml.IndexOf("</body>") - sHtml.IndexOf("<body>") + 7);
Extracting the Body from the Rest may be the right way to go, thanks!
RoguePlanetoid
+1  A: 

Extracting the Body was easier - here is the RegEx I am using:

@"\<body\>(.*?)\</body\>"

Now I can parse that normally with LINQ-to-XML!

RoguePlanetoid
+1 easy and simple
Teddy
Unless you're controlling the HTML and ensuring it is well-formed, `</body>` is not guaranteed to exist.
Austin Salonen
The HTML is always the same in this case, however this is a good point that this element may not be present in all cases.
RoguePlanetoid