views:

576

answers:

9
$pee = preg_replace( '|<p>|', "$1<p>", $pee );

This regular expression is from the Wordpress source code (formatting.php, wpautop function); I'm not sure what it does, can anyone help?

Actually I'm trying to port this function to Python...if anyone knows of an existing port already, that would be much better as I'm really bad with regex.

+2  A: 

...?

Actually, it looks like this takes the first <p> tag and prepends the previous regular expression's first match to it (since there's no match in this one),

However, it seems that this behavior is bad to say the least, as there's no guarantee that preg_* functions won't clobber $1 with their own values.

Edit: Judging from Jay's comment, this regex actually does nothing.

R. Bemrose
I don't think preg_replace will carry the backreference for $1 forward into the next invocation of preg_replace(). I tried a quick test and it doesn't seem to work that way. You could still be right, but if so it certainly would be terrible practice!
Jay
I didn't test it at work, as we don't have PHP installed here... I suppose I could have tested it remotely on my own web server.
R. Bemrose
+3  A: 

wordpress really calls a variable "pee" ?

I'm not sure what the $1 stands for (there are no braces in the first parameter?), so I don't think it actually does anything, but i could be wrong.

I.devries
They do indeed. In fact, there's a comment in the code: // don't pee all over a tag
Jay
A: 

It replace the match from the pattern

"|<p>|"

by the string

"$1<p>"

The | in the replacement pattern is causes the regex engine to match either the part on the left side, or the part on the right side.

I do not get why it's used that way because usually it's for something like "ta(b|p)e"...

For the $1, I guess the variable $1 is in the PHP code and it replaced during the preg_replace so if $1 = "test"; the replacement will replace the

"<p>"

to

"test<p>"

But I am not sure of it for the $1

Daok
$1 would be an illegal variable name so it can't be being set in the code. It has to be a backreference from the regular expression in preg_replace(), except there aren't any groups in the regex, so it should be just an empty string.
Jay
+3  A: 

The preg_replace() function - somewhat confusingly - allows you to use other delimiters besides the standard "/" for regular expressions, so

"|<p>|"

Would be a regular expression just matching

"<p>"

in the text. However, I'm not clear on what the replacement parameter of

"$1<p>"

would be doing, since there's no grouping to map to $1. It would seem like as given, this is just replacing a paragraph tag with an empty string followed by a paragraph tag, and in effect doing nothing.

Anyone with more in-depth knowledge of PHP quirks have a better analysis?

Jay
This isn't really unique to preg_replace - '/' isn't a standard, it's just popular convention. I actually can't think of a single regex implementation that forces you to use /
Daniel Papasian
heh. I've never seen anything other than / used in Perl, Python, Java or any other language for that matter. My understanding was this was done in PHP specifically just to avoid the ugliness of regexes that might include closing HTML tags with / in them.
Jay
@Jay: preg functions are processed with the PCRE library... Perl Compatible Regular Expressions. Not surprisingly, Perl can use characters other than / so preg can too.
R. Bemrose
I would say it's pretty standard practice to choose delimiters which are appropriate to what it is you're trying to find. If you're doing things with paths with lots of forward slashes, slashes in your regex's are just obtuse.
reefnet_alex
A: 

I highly recommend the amazing RegexBuddy

daniel
That probably won't help with this particular question, since the code in question isn't a standard regex issue. The delimiters are non-standard and the backreference doesn't actually come from within the pattern, so RegexBuddy probably won't be able to decipher this either.
Jay
Although I agree with you, RegexBuddy has options that shows the differences of the Regex's implementations in several languages, which might be handy for him, since he is trying to port it from php to python.
daniel
RegexBuddy will indeed show that this search-and-replace does nothing. The non-standard delimiters are no problem. Just select "Paste from PHP preg string" in the Paste menu, and RegexBuddy will figure it out. The $1 backreference is simply replaced with nothing, which RegexBuddy emulates too.
Jan Goyvaerts
+2  A: 

The pipe symbols "|" in this case do not have the default meaning of "match this or that" but are use as alternative delimiters for the pattern instead of the more common slashes "/". This may make sense, if you want to match for "/" without having to escape those appearences (e.g. "/(.*)\/(.*)\//" is not as readable as "#/(.*)/(.*)/#"). Seems quite contraproductive to use "|" instead which is just another reserved char for patterns, though.

Normally $1 in the replacement pattern should match the first group denoted by parentheses. E.g if you've got a pattern like

"(.*)<p>"

$0 would contain the whole match and $1 the part before the <p>.

As the given reg-ex does not declare any groups and $1 is not a valid name for a variable (in PHP4) defined elsewhere, this call seems to replace any occurences of <p> with <p>?

To be honest, now I'm also quite confused. Just a guess: gets another pattern-matching method (preg_match and the like) called before the given line so the "$1" is "leaked" from there?

Argelbargel
I tested that theory with a sample call to preg_replace and I wasn't able to get $1 to be referenced from the previous call. So it doesn't look like that's the case either, unless it's a quirk of particular PHP versions?
Jay
A: 

I don't have very much experience with RegEx an don't have a RegEx testing tool on me atm but after doing some searching and looking at other WordPress source code and comments, is it possible this code removes duplicate paragraph tags and replaces them wih a single set of tags.

Dalin Seivewright
Thought so, too - but preg_replace matches ANY p-Tag occuring in the subject and replaces it with the replacement-Pattern (so foo[p][p]bar stays foo[p][p]bar)
Argelbargel
A: 

I believe that line does nothing.

For what it's worth, this is the previous line, in which $1 is set:

$pee = preg_replace('!<p>([^<]+)\s*?(</(?:div|address|form)[^>]*>)!', "<p>$1</p>$2", $pee);

However, I don't think that's worth anything. In my testing, $1 does not maintain a value from one preg_replace to the next, even if the next doesn't set its own value for $1. Remember that PHP variable names cannot begin with a number (see: http://php.net/language.variables ), so $1 is not a PHP variable. It only means something within a single preg_replace, and in this case the rules of preg_replace suggest it doesn't mean anything.

That said, autop being such a widely-used function makes me doubt my own conclusion that this line is doing nothing. So I look forward to someone correcting me.

Scott Reynen
Same conclusion I came to, after testing it on my webserver. Hopefully someone can come along and confirm or deny ;)
Jay
PHP's preg functions do not set magic variables like Perl and Ruby do. $1 does not exist as a variable in PHP.
Jan Goyvaerts
A: 
Jan Goyvaerts