ansaurus

Question

How can I remove an entire HTML tag (and its contents) by its class using a regex?

Answer 1

A:

Partly depends on the exact regex engine you are using - which language etc. But one possibility is that you need to escape the quotes and/or the forward slash. You might also want to make it case insensitive.

<div class=\"footer\".*?>(.*?)<\/div>

Otherwise please say what language/platform you are using - .NET, java, perl ...

Hamish Downer 2008-10-22 16:31:06

Note that you need the /s option here since some of those characters may be newlines.

brian d foy 2008-10-23 01:10:47

Answer 2

A:

why not <div class="footer".*?</div> I'm not a regex guru either, but I don't think you need to specify that last bracket for your open div tag

Nick 2008-10-22 16:31:47

Perhaps he wants to capture the content of the div?

Chris Marasti-Georg 2008-10-22 16:33:11

Yes, he says he wants to remove the tags, not the content.

Hamish Downer 2008-10-22 16:34:46

That regex will capture everything between the first <div class="footer" and the last </div> of the entire webpage (unless the perl function isn't using it multiline).

Will 2008-10-22 16:37:09

Answer 3

+6 A:

You will also want to allow for other things before class in the div tag

<div[^>]*class="footer"[^>]*>(.*?)</div>

Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?

Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:

<div>
    <div class="footer">
        <div>Hi!</div>
    </div>
</div>

Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.

Pseudocode that should map closely to XML::DOM:

document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
    if(div.getAttributes["class"] == "footer") {
        parent = div.getParent();
        for(child in div.getChildren()) {
            // filter attribute types?
            parent.insertBefore(div, child);
        }
        parent.removeChild(div);
    }
}

Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.

Chris Marasti-Georg 2008-10-22 16:32:54

It works when all the html is in the same line, but not when it's indented.Why [^>] in : "footer"[^>]?

Daok 2008-10-22 16:39:33

To make the regexp deterministic. Most engines will handle indeterminacy without a problem, but it can sometimes yield unexpected results. Technically, there's still a non-deterministic issue between [^>] and [c], but it's less significant.

Daniel Spiewak 2008-10-22 16:42:54

Looking for anything that's not the closing bracket

Chris Marasti-Georg 2008-10-22 16:43:07

Answer 4

A:

Try this:

<([^\s]+).*?class="footer".*?>([.\n]*?)</([^\s]+)>

Your biggest problem is going to be nested tags. For example:

<div class="footer"><b></b></div>

The regexp given would match everything through the </b>, leaving the </div> dangling on the end. You will have to either assume that the tag you're looking for has no nested elements, or you will need to use some sort of parser from HTML to DOM and an XPath query to remove an entire sub-tree.

Daniel Spiewak 2008-10-22 16:34:19

You could use a back-reference on the first captured group at the end of the regex...

Chris Marasti-Georg 2008-10-22 16:38:11

The regex given would not match the middle tags. The lazy quantifier inside the div tag will stop matching at the > at the end of the div. And so the bold tags will be matched by the (.*?) as I think is wanted.

Hamish Downer 2008-10-22 16:39:32

Hmm, well it's either going to be too lazy or too greedy. Another answer gives an example of greedily matching one too *many* closing tags. Regular expressions just aren't powerful enough for this sort of thing.

Daniel Spiewak 2008-10-22 16:41:26

Answer 5

+1 A:

In Perl you need the /s modifier, otherwise the dot won't match a newline.

That said, using a proper HTML or XML parser to remove unwanted parts of a HTML file is much more appropriate.

moritz 2008-10-22 16:37:05

Answer 6

A:

This will be tricky because of the greediness of regular expressions, (Note that my examples may be specific to perl, but I know that greediness is a general issue with REs.) The second .*? will match as much as possible before the </div>, so if you have the following:

<div class="SomethingElse"><div class="footer"> stuff </div></div>

The expression will match:

<div class="footer"> stuff </div></div>

which is not likely what you want.

Graeme Perrow 2008-10-22 16:37:26

Answer 7

+13 A:

As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );

for my $node ( $tree->findnodes( '//*[@class="footer"]' ) ) {
    $node->replace_with_content;   # delete element, but not the children
}

print $tree->as_HTML;

Yanick 2008-10-22 16:52:25

and to delete the element and its children, replace 'replace_with_content' with 'detach'.

Yanick 2008-10-22 20:08:12

+1 for using XPath, which is full of win. :-D

Chris Jester-Young 2009-09-10 02:54:51

Answer 8

A:

<div[^>]*class="footer"[^>]*>(.*?)</div>

Worked for me, but needed to use backslashes before special characters

<div[^>]*class=\"footer\"[^>]*>(.*?)<\/div>

2009-02-05 04:07:42

ansaurus

tags:

views:

answers:

How can I remove an entire HTML tag (and its contents) by its class using a regex?

Update

related questions