tags:

views:

1465

answers:

7

I want a regex which can match conditional comments in a HTML source page so I can remove only those. I want to preserve the regular comments.

I would also like to avoid using the .*? notation if possible.

The text is

foo

<!--[if IE]>

<style type="text/css">

ul.menu ul li{
    font-size: 10px;
    font-weight:normal;
    padding-top:0px;
}

</style>

<![endif]-->

bar

and I want to remove everything in <!--[if IE]> and <![endif]-->

EDIT: It is because of BeautifulSoup I want to remove these tags. BeautifulSoup fails to parse and gives an incomplete source

EDIT2: [if IE] isn't the only condition. There are lots more and I don't have any list of all possible combinations.

EDIT3: Vinko Vrsalovic's solution works, but the actual problem why beautifulsoup failed was because of a rogue comment within the conditional comment. Like

<!--[if lt IE 7.]>
<script defer type="text/javascript" src="pngfix_253168.js"></script><!--png fix for IE-->
<![endif]-->

Notice the <!--png fix for IE--> comment?

Though my problem was solve, I would love to get a regex solution for this.

A: 

Don't use a regular expression for this. You will get confused about comments containing opening tags and what not, and do the wrong thing. HTML isn't regular, and trying to modify it with a single regular expression will fail.

Use a HTML parser for this. BeautifulSoup is a good, easy, flexible and sturdy one that is able to handle real-world (meaning hopelessly broken) HTML. With it you can just look up all comment nodes, examine their content (you can use a regular expression for that, if you wish) and remove them if they need to be removed.

Thomas Wouters
Strictly speaking the coditional comments are not HTML but an embedded macro language, which AFAIK cannot be nested. So a regex might work.
JacquesB
+1  A: 

@Benoit

Small Correction (with multiline turned on):

 "<!--\[if IE\]>.*?<!\[endif\]-->"
Nescio
Did you read the "I would also like to avoid using the .*? notation if possible." part?
Huppie
A: 

This works in Visual Studio 2005, where there is no line span option:

\<!--\[if IE\]\>{.|\n}*\<!\[endif\]--\>

Lev
+1  A: 
>>> from BeautifulSoup import BeautifulSoup, Comment
>>> html = '<html><!--[if IE]> bloo blee<![endif]--></html>'
>>> soup = BeautifulSoup(html)
>>> comments = soup.findAll(text=lambda text:isinstance(text, Comment) 
               and text.find('if') != -1) #This is one line, of course
>>> [comment.extract() for comment in comments]
[u'[if IE]> bloo blee<![endif]']
>>> print soup.prettify()
<html>
</html>
>>>

If your data gets BeautifulSoup confused, you can fix it before hand or customize the parser, among other solutions.

EDIT: Per your comment, you just modify the lambda passed to findAll as you need (I modified it)

Vinko Vrsalovic
That was helpful, but I don't want to lose all the comment tags. Only the conditional css comments.
cnu
+1  A: 

Here's what you'll need:

<!(|--)\[[^\]]+\]>.+?<!\[endif\](|--)>

It will filter out all sorts of conditional comments including:

<!--[if anything]>
    ...
<[endif]-->

and

<![if ! IE 6]>
    ...
<![endif]>


EDIT3: Vinko Vrsalovic's solution works, but the actual problem why beautifulsoup failed was because of a rogue comment within the conditional comment. Like

Notice the comment?

Though my problem was solve, I would love to get a regex solution for this.

How about this:

(<!(|--)\[[^\]]+\]>.*?)(<!--.+?-->)(.*?<!\[endif\](|--)>)

Do a replace on that regular expression leaving \1\4 (or $1$4) as the replacement.
I know it has .? and .+? in it, see my comment on this post._

Huppie
Sadly I was not able to avoid .+? syntax though...
Huppie
You can avoid the .+? syntax by doing a forward-reference but I don't have my regex book with me for the exact syntax :P
Huppie
A: 

I'd simply go with :

import re

html = """fjlk<wb>dsqfjqdsmlkf fdsijfmldsqjfl fjdslmfkqsjf<---- fdjslmjkqfs---><!--[if lt IE 7.]>\
<script defer type="text/javascript" src="pngfix_253168.js"></script><!--png fix for IE-->\
<![endif]-->fjlk<wb>dsqfjqdsmlkf fdsijfmldsqjfl fjdslmfkqsjf<---- fdjslmjkqfs--->"""

# here the black magic occurs (whithout '.')
clean_html = ''.join(re.split(r'<!--\[[^ø]+?endif]-->', html))

print clean_html

'fjlk<wb>dsqfjqdsmlkf fdsijfmldsqjfl fjdslmfkqsjf<---- fdjslmjkqfs--->fjlk<wb>dsqfjqdsmlkf fdsijfmldsqjfl fjdslmfkqsjf<---- fdjslmjkqfs--->'

N.B : [^ø] will match any char that is not 'ø'. This is really useful since it's lightning fast and this char can be found on any keyboard. But the trick is it's really hard to type (no one will type it by mistake) and nobody uses it : it's a generical money devise char.

If you don't feel like using ø, however, you can use chr(7) to generate the "system bell" char, wich is unprintable and can't be found in a web page ;-)

e-satis
+1  A: 

As I see it, you only need to worry about downlevel-hidden comments (the ones that start with <!--), and you don't need to match anything beyond the word if and the space following it. This should do what you want:

"<!--\[if\s(?:[^<]+|<(?!!\[endif\]-->))*<!\[endif\]-->"

That mess in the middle is to satisfy your desire not to use .*?, but I don't really think it's worth the effort. The .*? approach should work fine if you compile the regex with the Re.S flag set or wrap it in (?s:...). For example:

"(?s:<!--\[if\s.*?<!\[endif\]-->)"
Alan Moore