views:

547

answers:

5

I'm currently using PHP and a regular expression to strip out all HTML comments from a page. The script works well... a little too well. It strips out all comments including my conditional comments in the . Here's what I've got:

<?php
  function callback($buffer)
  {
        return preg_replace('/<!--(.|\s)*?-->/', '', $buffer);
  }

  ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>

Since my regex isn't too hot I'm having trouble trying to figure out how to modify the pattern to exclude Conditional comments such as:

<!--[if !IE]><!-->
<link rel="stylesheet" href="/css/screen.css" type="text/css" media="screen" />
<!-- <![endif]-->

<!--[if IE 7]>
<link rel="stylesheet" href="/css/ie7.css" type="text/css" media="screen" />
<![endif]-->

<!--[if IE 6]>
<link rel="stylesheet" href="/css/ie6.css" type="text/css" media="screen" />
<![endif]-->

Cheers

A: 

Something like this might work:

/<!--[^\[](.|\s)*?-->/

It's the same as yours, except that it ignores comments have an opening bracket immediately following the comment start tag.

Boden
Hi Boden. This method removes the comment but leaves the <> which means stylesheets aren't applied and the document is littered with arrows.
Ian
Are you calling it like this? (not sure if this code will post in a comment) preg_replace('/<!--[^\[](.|\s)*?-->/', '', $buffer);
Boden
I'll modify the answer to include the start and end characters...
Boden
Yes, entire line: return preg_replace('/<!--[^\[](.|\s)*?-->/', '', $buffer);This doesn't now leave the brackets but doesn't successfully leave the conditional comments either: <!--[if !IE]> <!--[if IE 7]> <link rel="stylesheet" href="/templates/css/ie7.css" type="text/css" media="screen" /> <![endif]> <!--[if IE 6]> <link rel="stylesheet" href="/templates/css/ie6.css" type="text/css" media="screen" /> <![endif]--> <link rel="stylesheet" type="text/css" media="print" href="/templates/css/print.css" />
Ian
Hmm... I'm running it here: http://www.solmetra.com/scripts/regex/index.php using the preg_replace option and I cut and pasted your code snippet: it leaves the conditionals in there.
Boden
Sounds strange indeed. I've just copied and pasted your code and tried it again but the same thing is still happening. My conditional comments are exactly as they are in my original question but still being left in like my comment above.
Ian
I've noticed that it seems to be only the first conditional comment which causes the problem. This is styled slightly differently to the others with the addition of the <!--> in the first line. This is to serve to other browsers other than IE (see:http://simplebits.com/notebook/2009/02/13/iegone.html). It just seems to need a slight tweek to cater for this then it will be a perfect fit.
Ian
A: 

I'm not sure if PHP's regex engine will like the following, but try this pattern:

'/<!--(.|\s)*(\[if .*\]){0}(.|\s)*?-->/'
hythlodayr
replaceing my regex with this prompts a download save pop-up of the index.php page rather than rendering it.
Ian
+1  A: 

If you can't get it to work with one regular expression or you find you want to preserve more comments you could use preg_replace_callback. You can then define a function to handle the comments individually.

<?php
function callback($buffer) {
    return preg_replace_callback('/<!--.*-->/U', 'comment_replace_func', $buffer);
}

function comment_replace_func($m) {
    if (preg_match( '/^\<\!--\[if \!/i', $m[0])) {
        return $m[0];   
    }              

    return '';
}   

ob_start("callback");
?>

... HTML source goes here ...

<?php ob_end_flush(); ?>
Tom Haigh
Am I right in thinking the script should be inserted into the head like this:<?php $result = preg_replace_callback('/<!--.*-->/U', 'comment_replace_func', $buffer); function comment_replace_func($m) { if (preg_match( '/^\<\!--\[if \!/i', $m[0])) { return $m[0]; } return ''; } ob_start("callback");?>... HTML source goes here ...<?php ob_end_flush(); ?>If so, this doesn't remove any comments or seem to have any effect?
Ian
+5  A: 

Since comments cannot be nested in HTML, a regex can do the job, in theory. Still, using some kind of parser would be the better choice, especially if your input is not guaranteed to be well-formed.

Here is my attempt at it. To match only normal comments, this would work. It has become quite a monster, sorry for that. I have tested it quite extensively, it seems to do it well, but I give no warranty.

<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->

Explanation:

<!--                #01: "<!--"
(?!                 #02: look-ahead: a position not followed by:
  \s*               #03:   any number of space
  (?:               #04:   non-capturing group, any of:
    \[if [^\]]+]    #05:     "[if ...]"
    |<!             #06:     or "<!"
    |>              #07:     or ">"
  )                 #08:   end non-capturing group
)                   #09: end look-ahead
(?:                 #10: non-capturing group:
  (?!-->)           #11:   a position not followed by "-->"
  .                 #12:   eat the following char, it's part of the comment
)*                  #13: end non-capturing group, repeat
-->                 #14: "-->"

Steps #02 and #11 are crucial. #02 makes sure that the following characters do not indicate a conditional comment. After that, #11 makes sure that the following characters do not indicate the end of the comment, while #12 and #13 cause the actual matching.

Apply with "global" and "dotall" flags.

To do the opposite (match only conditional comments), it would be something like this:

<!(--)?(?=\[)(?:(?!<!\[endif\]\1>).)*<!\[endif\]\1>

Explanation:

<!                  #01: "<!"
(--)?               #02: two dashes, optional
(?=\[)              #03: a position followed by "["
(?:                 #04: non-capturing group:
  (?!               #05:   a position not followed by
    <!\[endif\]\1>  #06:     "<![endif]>" or "<![endif]-->" (depends on #02)
  )                 #07:   end of look-ahead
  .                 #08:   eat the following char, it's part of the comment
)*                  #09: end of non-capturing group, repeat
<!\[endif\]\1>      #10: "<![endif]>" or "<![endif]-->" (depends on #02)

Again, apply with "global" and "dotall" flags.

Step "#02" is because of the "downlevel-revealed" syntax, see: "MSDN - About Conditional Comments".

Not entirely sure where spaces are allowed or expected. Add "\s*" to the expression wherever it is appropriate.

Tomalak
Hi Tomalak, thanks for your input and the detailed explanations. Makes regex much easier :). However, I've just tried your solution and it doesn't display anything at all except a blank page. The full line I'm using is: return preg_replace('<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->', '', $buffer); Is this correct?
Ian
No, it isn't. You should read the docs on preg_replace. :-)
Tomalak
I've got to admit I've not come across preg_replace before so I deffo give the docs a read as soon as I get the chance. For the purpose of this particular problem however, is it possible for you to elaborate a little on how to implement it? Although it looks more extensive than regex it sounds like an interesting approach which I'd like to try.
Ian
You have a 'chance' to read the docs right now: http://php.net/manual/en/function.preg-replace.php :)) (also: http://www.php.net/manual/en/pcre.pattern.php )
Tomáš Kafka
A: 

In summary this seems to be the best solution:

<?php
  function callback($buffer) {
    return preg_replace('/<!--[^\[](.|\s)*?-->/', '', $buffer);
  }
  ob_start("callback");
?>
... HTML source goes here ...
<?php ob_end_flush(); ?>

It strips out all comments and leaves conditionals with the exception of the top one:

<!--[if !IE]><!-->
<link rel="stylesheet" href="/css/screen.css" type="text/css" media="screen" />
<!-- <![endif]-->

where the additional seems to be causing the problem.

If anyone can suggest the regex which would take this into account and leave that condtional in place too then that would be perfect.

Tomalak's solution looks good but as a newbie and no further guidelines I don't know how to implement it although I would like to try it if anyone can elaborate on how to apply it?

Thanks

Ian