views:

278

answers:

3

Hi there,

I am writing a comment-stripper and trying to accommodate for all needs here. I have the below stack of code which removes pretty much all comments, but it actually goes too far. A lot of time was spent trying and testing and researching the regex patterns to match, but I don't claim that they are the best at each.

My problem is that I also have situation where I have 'PHP comments' (that aren't really comments' in standard code, or even in PHP strings, that I don't actually want to have removed.

Example:

<?php $Var = "Blah blah //this must not comment"; // this must comment. ?>

What ends up happening is that it strips out religiously, which is fine, but it leaves certain problems:

<?php  $Var = "Blah blah  ?>

Also:

will also cause problems, as the comment removes the rest of the line, including the ending ?>

See the problem? So this is what I need...

  • Comment characters within '' or "" need to be ignored
  • PHP Comments on the same line, that use double-slashes, should remove perhaps only the comment itself, or should remove the entire php codeblock.

Here's the patterns I use at the moment, feel free to tell me if there's improvement I can make in my existing patterns? :)

$CompressedData = $OriginalData;
$CompressedData = preg_replace('!/\*.*?\*/!s', '', $CompressedData);  // removes /* comments */
$CompressedData = preg_replace('!//.*?\n!', '', $CompressedData); // removes //comments
$CompressedData = preg_replace('!#.*?\n!', '', $CompressedData); // removes # comments
$CompressedData = preg_replace('/<!--(.*?)-->/', '', $CompressedData); // removes HTML comments

Any help that you can give me would be greatly appreciated! :)

+3  A: 

If you want to parse PHP, you can use token_get_all to get the tokens of a given PHP code. Then you just need to iterate the tokens, remove the comment tokens and put the rest back together.

But you would need a separate procedure for the HTML comments, preferably a real parser too (like DOMDocument provides with DOMDocument::loadHTML).

Gumbo
Although, most "HTML" parsers are actually XML parsers and won't be able to properly parse HTML that is often used with PHP, since the files themselves are rarely well formed (even if the resulting page was).
Rithiur
This is why DOMDocument has the loadHTML method, which can make sense of totally wrangled HTML. DOMDocument in combination witha na xpath exprsssion that finds all comments and removes them seems be be a valid option for the HTML comments. Plus, it makes the resulting HTML XHTML-compliant.
chiborg
A: 

One way to do this in REGEX is to use one compound expression and preg_replace_callback.

I was going to post a poor example but the best place to look is at the source code to the PHP port of Dean Edwards' JS packer script - you should see the general idea.

http://joliclic.free.fr/php/javascript-packer/en/

banks
This is just for internal compression of HTML, JS and PHP in a single script, and performance is not a concern. In fact, it's surprisingly quick, even though I know REGEX replaces as such are not the optimal way for doing this.I've managed to get the thing working as I want to, but now I need to get it so that it removes any /n newlines, except if they are contained within "" or ''. Any clues? Dean's packer may not be able to help me with this particular issue. It's probably simple though... I'm a bit n00b at this, hehe, it's largely experimentation on my side.
David
+1  A: 

You should first think carefully whether you actually want to do this. Though what you're doing may seem simple, in the worst case scenario, it becomes extremely complex problem (to solve with just few regular expressions). Let me just illustrate just of the few problems you would be facing when trying to strip both HTML and PHP comments from a file.

You can't straight out strip HTML comments, because you may have PHP inside the HTML comments, like:

<!-- HTML comment <?php echo 'Actual PHP'; ?> -->

You can't just simply separately deal with stuff inside the <?php and ?> tags either, since the ending thag ?> can be inside strings or even comments, like:

<?php /* ?> This is still a PHP comment <?php */ ?>

Let's not forget, that ?> actually ends the PHP, if it's preceded by one line comment. For example:

<?php // ?> This is not a PHP comment <?php ?>

Of course, like you already illustrated, there will be plenty of problems with comment indicators inside strings. Parsing out strings to ignore them isn't that simple either, since you have to remember that quotes can be escaped. Like:

<?php
$foo = ' /* // None of these start a comment ';
$bar = ' \' // Remember escaped quotes ';
$orz = " ' \" \' /* // Still not a comment ";
?>

Parsing order will also cause you headache. You can't just simply choose to parse either the one line comments first or the multi line comments first. They both have to be parsed at the same time (i.e. in the order they appear in the document). Otherwise you may end up with broken code. Let me illustrate:

<?php
/* // Multiline comment */
// /* Single Line comment
$omg = 'This is not in a comment */';
?>

If you parse multi line comments first, the second /* will eat up part of the string destroying the code. If you parse the single line comments first, you will end up eating the first */, which will also destroy the code.

As you can see, there are many complex scenarios you'd have to account, if you intend to solve your problem with regular expression. The only correct solution is to use some sort of PHP parser, like token_get_all(), to tokenize the entire source code and strip the comment tokens and rebuild the file. Which, I'm afraid, isn't entirely simple either. It also won't help with HTML comments, since the HTML is left untouched. You can't use XML parsers to get the HTML comments either, because the HTML is rarely well formed with PHP.

To put it short, the idea of what you're doing is simple, but the actual implementation is much harder than it seems. Thus, I would recommend trying to avoid doing this, unless you have a very good reason to do it.

Rithiur
All very good points, and I have automatically accounted for those kinds of situations with overall success. As i said in the below comment, I'm using this for internal purposes, so it doesn't have to be perfect. I've managed to solve most of my problems, the only thing in my way now is the removal of newline characters - I _DON'T_ want to remove newlines inside of strings. See, this is for my own coding style in general, so I know how I comment things and such, and I've worked the reg-ex's accordingly. In all my tests, everything is fine, for now. :) Except the 'intentional' newlines.
David