tags:

views:

238

answers:

7

Anyone has a regex that can remove the attributes from a body tag

for example:

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

to return:

<body>

It would also be interesting to see an example of removing just a specific attribute, like:

<body bgcolor="White" style="font-family:sans-serif;font-size:10pt;">

to return:

<body bgcolor="White">
A: 

LittleBobbyTables comment above is correct!

Regex is not the right tool, if you read it, it's actually true, using regex for this kind of thing will strike you down with undue strain and stress as the answer clearly shown on that link that LittleBobbyTables posted, what the answerer experienced as a result of using the wrong tool for the wrong job.

Regex is NOT the duct tape for doing such things nor is the answer to everything including 42... use the right tool for the right job

However you should check out HtmlAgilityPack which will do the job for you and ultimately save you from the stress, tears and blood as a result of getting to the grips of death using regex to parse html...

tommieb75
give an example of HtmlAgilityPack accomplishing what I want?
Brandon
tommieb75
I've already read it thank you very much.. and dont' find it helpful for my scenario.
Brandon
I've also read http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack but don't see how to modify this to suit my requirements...
Brandon
just because you know alot doesn't mean you have to be so assumptuous about others 'not bothering'
Brandon
@tommieb75: are you serious? if i said it *wasn't* a self promo, would you still flag it? it's perfectly related to your post, and it's not like i'm making money off the damn thing. i'm sharing it out of the goodness of my heart for pete's sake!
Mark
@Brandon: http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack
tommieb75
@Brandon: I think you're right actually. What you would do is use HTML agility pack to find the body tag, remove all the attributes, then re-render the HTML... which I'm not even sure is possible with htmlagilitypack.. never used it for generating html.
Mark
tommieb75... i think you are the worse SO user I've met so far. congrats.
Brandon
+1  A: 

If you're doing a quick-and-dirty shell script, and you don't plan on using this much...

s/<body [^>]*>/<body>/

but I'm going to have to agree with everyone else that a parser is a better idea. I understand that sometimes you must make do with limited resources, but if you rely on a regex here... it has a strong chance of coming back to bite you when you least expect it.

and to remove a specific attribute:

s/\(<body [^>]*\) style="[^>"]*"/\1/

That will grab "body" and any attributes up to "style", drop the "style" attribute, and spit out the rest.

Tim
In what way could it come back to bite him if all he wants to do is remove unnestable attributes?
MooGoo
@Moogoo - see my comment above!
tommieb75
how to use this in C#?
Brandon
Regardless, this is the only answer that actually bothered to *answer* the question and not just mindlessly spout "bad bad bad evil evil evil". So, +1
MooGoo
@Moogoo totally agree. I'm surprised that there are so many HtmlAgilityPack or Anti-HTML-regex disciples. Mind you I'm not against HtmlAgilityPack... just want a more measured response.
Brandon
It is simply a knee-jerk reaction as many people *do* want to use regex to parse nested HTML tags which as you may have heard will not work. However in the limited case you are describing, it should do just fine. I'm fairly certain that many programmers here use regex to find and replace things in their code *all the time* without ripping a single hole in the fabric of spacetime. Programing languages are not regular either, so next time you want to change the name of a variable, it damn well better be done using an abstract syntax tree!
MooGoo
Mainly, it's pretty easy to mess up a regex. In testing, it may work for every case you try, but in production, you are very likely to encounter a new, unexpected case. This can cause problems with some regex casting too wide a net. And modifying xml or html this way can result in invalid xml or html. So, basically, there's a risk of bugs. But as long as you understand the risks, regex can still be very useful. And yes, the lack of any real answers (at the time) is why I posted this in the first place. Just because a tool may not be The Best Way (tm) doesn't mean it's not useful.
Tim
@Tim you're right.
Brandon
+3  A: 

You can't parse XHTML with regex. Have a look at the HTML Agility Pack instead.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{
    body.Attributes.Remove("style");
}
dtb
what if the html block i'm looking at does not contain the body tag/node, will this still work? I'm only filtering a certain section of a page.
Brandon
@Brandon: SelectSingleNode returns `null` if no body element is present.
dtb
now.. that's a proper answer...
Brandon
possible to get the doc return as a string again ? - same way it goes in as a string in (html)
Brandon
@Brandon: Try `doc.Save(filename)` or `doc.DocumentNode.OuterXml`.
dtb
@Brandon: see my sharpquery answer to find out how to output the document again. it uses htmlagilitypack under the hood.. just makes finding tags easier.
Mark
also, i disagree with "don't even think about it". regexes aren't great in general for parsing html..but for stripping off a few attributes, i think a regex is fine.
Mark
thanks, led me to doc.DocumentNode.OuterHtml; which works. I'm looking for a way to remove meta and link tags also.
Brandon
I tried to download the documentation but when I open the chm file.. the right pnl shows error 'navigation to webpage was canceled.'
Brandon
yea... i haven't managed to find any documentation on htmlagilitypack either. you kind of just have to guess ;) meta and link tags can be removed in the exact same fashion, no?
Mark
@Mark, yes I can remove them in the same fashion... the ones I'm seeing are hiding in <!--[if gte mso 9]><![endif]-->, I'm seeing if I can remove that as well....
Brandon
started new thread about removing comments here:http://stackoverflow.com/questions/3818404/how-to-select-node-types-which-are-htmlnodetype-comment-using-htmlagilitypack
Brandon
A: 

Here's how you'd do it in SharpQuery

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
var sq = SharpQuery.Load(html);
var body = sq.Find("body").Single();
foreach (var a in body.Attributes.ToArray())
    a.Remove();
StringWriter sw = new StringWriter();
body.OwnerDocument.Save(sw);
Console.WriteLine(sw.ToString());

Which depends on HtmlAgilityPack and is a beta product... but I wanted to prove that you could do it this way.

Mark
+1  A: 

Three ways to do it with regexes...

string html = "<body bgcolor=\"White\" style=\"font-family:sans-serif;font-size:10pt;\">";
string a1 = Regex.Replace(html, @"(?<=<body\b).*?(?=>)", "");
string a2 = Regex.Replace(html, @"<(body)\b.*?>", "<$1>");
string a3 = Regex.Replace(html, @"<(body)(\s[^>]*)?>", "<$1>");
Console.WriteLine(a1);
Console.WriteLine(a2);
Console.WriteLine(a3);
Mark
A: 
string pattern = @"<body[^>]*>";
string test = @"<body bgcolor=""White"" style=""font-family:sans-serif;font-size:10pt;"">";
string result = Regex.Replace(test,pattern,"<body>",RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);
string pattern2 = @"(?<=<body[^>]*)\s*style=""[^""]*""(?=[^>]*>)";
result = Regex.Replace(test, pattern2, "", RegexOptions.IgnoreCase);
Console.WriteLine("{0}",result);

This is just in case your project requirements limit your third party options (and doesn't give you the time to reinvent a parser).

Les
A: 

Chunky code I've got working at the moment, will be looking at reducing this:

private static string SimpleHtmlCleanup(string html)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);

            //foreach(HtmlNode nodebody in doc.DocumentNode.SelectNodes("//a[@href]"))

            var bodyNodes = doc.DocumentNode.SelectNodes("//body");
            if (bodyNodes != null)
            {
                foreach (HtmlNode nodeBody in bodyNodes)
                {
                    nodeBody.Attributes.Remove("style"); 
                }
            }

            var scriptNodes = doc.DocumentNode.SelectNodes("//script");
            if (scriptNodes != null)
            {
                foreach (HtmlNode nodeScript in scriptNodes)
                {
                    nodeScript.Remove();
                }
            }

            var linkNodes = doc.DocumentNode.SelectNodes("//link");
            if (linkNodes != null)
            {
                foreach (HtmlNode nodeLink in linkNodes)
                {
                    nodeLink.Remove();
                }
            }

            var xmlNodes = doc.DocumentNode.SelectNodes("//xml");
            if (xmlNodes != null)
            {
                foreach (HtmlNode nodeXml in xmlNodes)
                {
                    nodeXml.Remove();
                }
            }

            var styleNodes = doc.DocumentNode.SelectNodes("//style");
            if (styleNodes != null)
            {
                foreach (HtmlNode nodeStyle in styleNodes)
                {
                    nodeStyle.Remove();
                }
            }

            var metaNodes = doc.DocumentNode.SelectNodes("//meta");
            if (metaNodes != null)
            {
                foreach (HtmlNode nodeMeta in metaNodes)
                {
                    nodeMeta.Remove();
                }
            }

            var result = doc.DocumentNode.OuterHtml;

            return result;
        }
Brandon
this is for reference.
Brandon
code very much improved/reduced, reference here: http://stackoverflow.com/questions/3818404/how-to-select-node-types-which-are-htmlnodetype-comment-using-htmlagilitypack/3828478#3828478
Brandon