ansaurus

Question

(php) regexto remove comments but ignore occurances within strings

Answer 1

+3 A:

If you want to parse PHP, you can use token_get_all to get the tokens of a given PHP code. Then you just need to iterate the tokens, remove the comment tokens and put the rest back together.

But you would need a separate procedure for the HTML comments, preferably a real parser too (like DOMDocument provides with DOMDocument::loadHTML).

Gumbo 2010-03-19 08:51:45

Although, most "HTML" parsers are actually XML parsers and won't be able to properly parse HTML that is often used with PHP, since the files themselves are rarely well formed (even if the resulting page was).

Rithiur 2010-03-19 10:35:16

This is why DOMDocument has the loadHTML method, which can make sense of totally wrangled HTML. DOMDocument in combination witha na xpath exprsssion that finds all comments and removes them seems be be a valid option for the HTML comments. Plus, it makes the resulting HTML XHTML-compliant.

chiborg 2010-03-19 11:28:08

Answer 2

A:

One way to do this in REGEX is to use one compound expression and preg_replace_callback.

I was going to post a poor example but the best place to look is at the source code to the PHP port of Dean Edwards' JS packer script - you should see the general idea.

http://joliclic.free.fr/php/javascript-packer/en/

banks 2010-03-19 10:05:51

This is just for internal compression of HTML, JS and PHP in a single script, and performance is not a concern. In fact, it's surprisingly quick, even though I know REGEX replaces as such are not the optimal way for doing this.I've managed to get the thing working as I want to, but now I need to get it so that it removes any /n newlines, except if they are contained within "" or ''. Any clues? Dean's packer may not be able to help me with this particular issue. It's probably simple though... I'm a bit n00b at this, hehe, it's largely experimentation on my side.

David 2010-03-19 11:01:12

Answer 3

+1 A:

You should first think carefully whether you actually want to do this. Though what you're doing may seem simple, in the worst case scenario, it becomes extremely complex problem (to solve with just few regular expressions). Let me just illustrate just of the few problems you would be facing when trying to strip both HTML and PHP comments from a file.

You can't straight out strip HTML comments, because you may have PHP inside the HTML comments, like:

<!-- HTML comment <?php echo 'Actual PHP'; ?> -->

You can't just simply separately deal with stuff inside the <?php and ?> tags either, since the ending thag ?> can be inside strings or even comments, like:

<?php /* ?> This is still a PHP comment <?php */ ?>

Let's not forget, that ?> actually ends the PHP, if it's preceded by one line comment. For example:

<?php // ?> This is not a PHP comment <?php ?>

Of course, like you already illustrated, there will be plenty of problems with comment indicators inside strings. Parsing out strings to ignore them isn't that simple either, since you have to remember that quotes can be escaped. Like:

<?php
$foo = ' /* // None of these start a comment ';
$bar = ' \' // Remember escaped quotes ';
$orz = " ' \" \' /* // Still not a comment ";
?>

Parsing order will also cause you headache. You can't just simply choose to parse either the one line comments first or the multi line comments first. They both have to be parsed at the same time (i.e. in the order they appear in the document). Otherwise you may end up with broken code. Let me illustrate:

<?php
/* // Multiline comment */
// /* Single Line comment
$omg = 'This is not in a comment */';
?>

If you parse multi line comments first, the second /* will eat up part of the string destroying the code. If you parse the single line comments first, you will end up eating the first */, which will also destroy the code.

As you can see, there are many complex scenarios you'd have to account, if you intend to solve your problem with regular expression. The only correct solution is to use some sort of PHP parser, like token_get_all(), to tokenize the entire source code and strip the comment tokens and rebuild the file. Which, I'm afraid, isn't entirely simple either. It also won't help with HTML comments, since the HTML is left untouched. You can't use XML parsers to get the HTML comments either, because the HTML is rarely well formed with PHP.

To put it short, the idea of what you're doing is simple, but the actual implementation is much harder than it seems. Thus, I would recommend trying to avoid doing this, unless you have a very good reason to do it.

Rithiur 2010-03-19 10:58:15

All very good points, and I have automatically accounted for those kinds of situations with overall success. As i said in the below comment, I'm using this for internal purposes, so it doesn't have to be perfect. I've managed to solve most of my problems, the only thing in my way now is the removal of newline characters - I _DON'T_ want to remove newlines inside of strings. See, this is for my own coding style in general, so I know how I comment things and such, and I've worked the reg-ex's accordingly. In all my tests, everything is fine, for now. :) Except the 'intentional' newlines.

David 2010-03-19 11:29:13

ansaurus

tags:

views:

answers:

(php) regexto remove comments but ignore occurances within strings

related questions