ansaurus

Question

Regex to remove body tag attributes (C#)

Answer 1

A:

LittleBobbyTables comment above is correct!

Regex is not the right tool, if you read it, it's actually true, using regex for this kind of thing will strike you down with undue strain and stress as the answer clearly shown on that link that LittleBobbyTables posted, what the answerer experienced as a result of using the wrong tool for the wrong job.

Regex is NOT the duct tape for doing such things nor is the answer to everything including 42... use the right tool for the right job

However you should check out HtmlAgilityPack which will do the job for you and ultimately save you from the stress, tears and blood as a result of getting to the grips of death using regex to parse html...

tommieb75 2010-09-28 23:54:21

give an example of HtmlAgilityPack accomplishing what I want?

Brandon 2010-09-29 00:15:33

tommieb75 2010-09-29 00:32:26

I've already read it thank you very much.. and dont' find it helpful for my scenario.

Brandon 2010-09-29 00:34:10

I've also read http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack but don't see how to modify this to suit my requirements...

Brandon 2010-09-29 00:39:23

just because you know alot doesn't mean you have to be so assumptuous about others 'not bothering'

Brandon 2010-09-29 00:40:14

@tommieb75: are you serious? if i said it *wasn't* a self promo, would you still flag it? it's perfectly related to your post, and it's not like i'm making money off the damn thing. i'm sharing it out of the goodness of my heart for pete's sake!

Mark 2010-09-29 00:44:50

@Brandon: http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack

tommieb75 2010-09-29 00:48:32

@Brandon: I think you're right actually. What you would do is use HTML agility pack to find the body tag, remove all the attributes, then re-render the HTML... which I'm not even sure is possible with htmlagilitypack.. never used it for generating html.

Mark 2010-09-29 00:49:04

tommieb75... i think you are the worse SO user I've met so far. congrats.

Brandon 2010-09-29 00:49:45

Answer 2

+1 A:

If you're doing a quick-and-dirty shell script, and you don't plan on using this much...

s/<body [^>]*>/<body>/

but I'm going to have to agree with everyone else that a parser is a better idea. I understand that sometimes you must make do with limited resources, but if you rely on a regex here... it has a strong chance of coming back to bite you when you least expect it.

and to remove a specific attribute:

s/\(<body [^>]*\) style="[^>"]*"/\1/

That will grab "body" and any attributes up to "style", drop the "style" attribute, and spit out the rest.

Tim 2010-09-28 23:58:45

In what way could it come back to bite him if all he wants to do is remove unnestable attributes?

MooGoo 2010-09-29 00:06:58

@Moogoo - see my comment above!

tommieb75 2010-09-29 00:08:36

how to use this in C#?

Brandon 2010-09-29 00:08:44

Regardless, this is the only answer that actually bothered to *answer* the question and not just mindlessly spout "bad bad bad evil evil evil". So, +1

MooGoo 2010-09-29 00:44:10

@Moogoo totally agree. I'm surprised that there are so many HtmlAgilityPack or Anti-HTML-regex disciples. Mind you I'm not against HtmlAgilityPack... just want a more measured response.

Brandon 2010-09-29 00:48:15

It is simply a knee-jerk reaction as many people *do* want to use regex to parse nested HTML tags which as you may have heard will not work. However in the limited case you are describing, it should do just fine. I'm fairly certain that many programmers here use regex to find and replace things in their code *all the time* without ripping a single hole in the fabric of spacetime. Programing languages are not regular either, so next time you want to change the name of a variable, it damn well better be done using an abstract syntax tree!

MooGoo 2010-09-29 00:57:38

Mainly, it's pretty easy to mess up a regex. In testing, it may work for every case you try, but in production, you are very likely to encounter a new, unexpected case. This can cause problems with some regex casting too wide a net. And modifying xml or html this way can result in invalid xml or html. So, basically, there's a risk of bugs. But as long as you understand the risks, regex can still be very useful. And yes, the lack of any real answers (at the time) is why I posted this in the first place. Just because a tool may not be The Best Way (tm) doesn't mean it's not useful.

Tim 2010-09-29 16:49:10

@Tim you're right.

Brandon 2010-09-30 06:24:53

Answer 3

+3 A:

You can't parse XHTML with regex. Have a look at the HTML Agility Pack instead.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{
    body.Attributes.Remove("style");
}

dtb 2010-09-29 00:03:35

what if the html block i'm looking at does not contain the body tag/node, will this still work? I'm only filtering a certain section of a page.

Brandon 2010-09-29 00:42:57

@Brandon: SelectSingleNode returns `null` if no body element is present.

dtb 2010-09-29 00:45:07

now.. that's a proper answer...

Brandon 2010-09-29 00:53:34

possible to get the doc return as a string again ? - same way it goes in as a string in (html)

Brandon 2010-09-29 01:07:54

@Brandon: Try `doc.Save(filename)` or `doc.DocumentNode.OuterXml`.

dtb 2010-09-29 01:18:57

@Brandon: see my sharpquery answer to find out how to output the document again. it uses htmlagilitypack under the hood.. just makes finding tags easier.

Mark 2010-09-29 01:24:30

also, i disagree with "don't even think about it". regexes aren't great in general for parsing html..but for stripping off a few attributes, i think a regex is fine.

Mark 2010-09-29 01:25:46

thanks, led me to doc.DocumentNode.OuterHtml; which works. I'm looking for a way to remove meta and link tags also.

Brandon 2010-09-29 01:29:39

I tried to download the documentation but when I open the chm file.. the right pnl shows error 'navigation to webpage was canceled.'

Brandon 2010-09-29 01:34:25

yea... i haven't managed to find any documentation on htmlagilitypack either. you kind of just have to guess ;) meta and link tags can be removed in the exact same fashion, no?

Mark 2010-09-29 01:37:53

@Mark, yes I can remove them in the same fashion... the ones I'm seeing are hiding in , I'm seeing if I can remove that as well....

Brandon 2010-09-29 02:16:07

started new thread about removing comments here:http://stackoverflow.com/questions/3818404/how-to-select-node-types-which-are-htmlnodetype-comment-using-htmlagilitypack

Brandon 2010-09-29 02:48:02

Answer 4

A:

Here's how you'd do it in SharpQuery

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
var sq = SharpQuery.Load(html);
var body = sq.Find("body").Single();
foreach (var a in body.Attributes.ToArray())
    a.Remove();
StringWriter sw = new StringWriter();
body.OwnerDocument.Save(sw);
Console.WriteLine(sw.ToString());

Which depends on HtmlAgilityPack and is a beta product... but I wanted to prove that you could do it this way.

Mark 2010-09-29 01:06:41

Answer 5

+1 A:

Three ways to do it with regexes...

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
string a1 = Regex.Replace(html, @"(?<=<body\b).*?(?=>)", "");
string a2 = Regex.Replace(html, @"<(body)\b.*?>", "<$1>");
string a3 = Regex.Replace(html, @"<(body)(\s[^>]*)?>", "<$1>");
Console.WriteLine(a1);
Console.WriteLine(a2);
Console.WriteLine(a3);

Mark 2010-09-29 01:23:16

Answer 6

A:

string pattern = @"<body[^>]*>";
string test = @"<body bgcolor=""White"" style=""font-family:sans-serif;font-size:10pt;"">";
string result = Regex.Replace(test,pattern,"<body>",RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);
string pattern2 = @"(?<=<body[^>]*)\s*style=""[^""]*""(?=[^>]*>)";
result = Regex.Replace(test, pattern2, "", RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);

This is just in case your project requirements limit your third party options (and doesn't give you the time to reinvent a parser).

Les 2010-09-29 01:32:38

Answer 7

A:

Chunky code I've got working at the moment, will be looking at reducing this:

private static string SimpleHtmlCleanup(string html)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            //foreach(HtmlNode nodebody in doc.DocumentNode.SelectNodes("//a[@href]"))

            var bodyNodes = doc.DocumentNode.SelectNodes("//body");
            if (bodyNodes != null)
            {
                foreach (HtmlNode nodeBody in bodyNodes)
                {
                    nodeBody.Attributes.Remove("style"); 
                }
            }

            var scriptNodes = doc.DocumentNode.SelectNodes("//script");
            if (scriptNodes != null)
            {
                foreach (HtmlNode nodeScript in scriptNodes)
                {
                    nodeScript.Remove();
                }
            }

            var linkNodes = doc.DocumentNode.SelectNodes("//link");
            if (linkNodes != null)
            {
                foreach (HtmlNode nodeLink in linkNodes)
                {
                    nodeLink.Remove();
                }
            }

            var xmlNodes = doc.DocumentNode.SelectNodes("//xml");
            if (xmlNodes != null)
            {
                foreach (HtmlNode nodeXml in xmlNodes)
                {
                    nodeXml.Remove();
                }
            }

            var styleNodes = doc.DocumentNode.SelectNodes("//style");
            if (styleNodes != null)
            {
                foreach (HtmlNode nodeStyle in styleNodes)
                {
                    nodeStyle.Remove();
                }
            }

            var metaNodes = doc.DocumentNode.SelectNodes("//meta");
            if (metaNodes != null)
            {
                foreach (HtmlNode nodeMeta in metaNodes)
                {
                    nodeMeta.Remove();
                }
            }

            var result = doc.DocumentNode.OuterHtml;

            return result;
        }

Brandon 2010-09-29 01:52:41

this is for reference.

Brandon 2010-09-29 02:16:37

code very much improved/reduced, reference here: http://stackoverflow.com/questions/3818404/how-to-select-node-types-which-are-htmlnodetype-comment-using-htmlagilitypack/3828478#3828478

Brandon 2010-09-30 07:03:36

ansaurus

tags:

views:

answers:

Regex to remove body tag attributes (C#)

related questions