views:

206

answers:

5

This is the regular expression used for "shortcodes" in WordPress (one for the whole tag, other for the attributes).

return '(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)';
$pattern = '/(\w+)\s*=\s*"([^"]*)"(?:\s|$)|(\w+)\s*=\s*\'([^\']*)\'(?:\s|$)|(\w+)\s*=\s*([^\s\'"]+)(?:\s|$)|"([^"]*)"(?:\s|$)|(\S+)(?:\s|$)/';

It parses stuff like

[foo bar="baz"]content[/foo]

or

[foo /]

In the WordPress trac they say it's a bit flawed, but my main problem is that it don't support shortcodes inside the attributes, like in

[foo bar="[baz /]"]content[/foo]

because the regex stops the main shortcode at the first appearance of a closing bracket, so in the example it renders

[foo bar="[baz /]

and

"]content[/foo]

shows as it is.

Is there any way to change the regex so it bypass any occurrence of [ with ] and its content when occurs between the opening tag or self-closing tag?

A: 

What is your goal? Even if WordPress’ regex were better, the shortcode would not be executed.

toscho
A: 
return '(.?)\[('.$tagregexp.')\b((?:"[^"]*"|.)*?)(?:/)?\](?:(.+?)\[\/\2\])?(.?)';

is a variation on the first regex where the bit that matches the attributes has been changed to capture strings completely without regard to what's in them:

(?:"[^"]*"|.)*?

instead of

.*?

Note that it doesn't handle strings with escaped quote characters in them (yet - can be done, but is it necessary?). I haven't changed anything else because I don't know the syntax for WordPress shortcodes.

But it looks like it could have been cleaned up a little by removing unnecessary backslashes and parentheses:

return '(.?)\[(foo)\b((?:"[^"]*"|.)*?)/?\](?:(.+?)\[/\2\])?(.?)';

Perhaps further improvements are warranted. I'm a bit worried about the unprecise dot in the above snippet, and I'd rather use (?:"[^"]*"|[^/\]])* instead of (?:"[^"]*"|.)*?, but I don't know whether that would break something else. Also, I don't know what the leading and trailing (.?) are good for. They don't match anything in your example so I don't know their purpose.

Tim Pietzcker
A: 

Do you want a drop-in replacement for that regex? This one allows attribute values to contain things that look like tags, as in your example:

'(.?)\[(\w+)\b((?:[^"\'\[\]]++|(?:"[^"]*+")|(?:\'[^\']*+\'))*+)\](?:(?<=(\/)\])|([^\[\]]*+)\[\/\2\])(.?)'

Or, in more readable form:

/(.?)              # could be [
 \[(\w+)\b         # tag name
 ((?:[^"'\[\]]++   # attributes
    |(?:"[^"]*+")
    |(?:'[^']*+')
  )*+
 )\]
 (?:(?<=(\/)\])   # '/' if self-closing
   |([^\[\]]*+)   # ...or content
    \[\/\2\]      # ...and closing tag
 )(.?)            # could be ]
/

As I understand it, $tagregexp in the original is an alternation of all the tag names that have been defined; I substituted \w+ for readability. Everything the original regex captures, this one does too, and in the same groups. The only difference is that the / in a self-closing tag is captured in group #3 along with the attributes as well as in its own group (#4).

I don't think the other regex needs to be changed unless you want to add full support for tags embedded in attribute values. That would also mean allowing for escaped quotes in this one, and I don't know how you would want to do that. Doubling them would be my guess; that's how Textpattern does it, and WordPress is supposedly based on that.

This question is a good example of why apps like WordPress shouldn't be implemented with regexes. The only way to add or change functionality is by making the regexes bigger and uglier and even harder to maintain.

Alan Moore
I tried replacing the whole regex and the shortcode shows as text. I tried with the 3rd group only and normal shortcode runs, but when there's a shortcode inside a shortcode, only the one(s) inside run and the surrounding one shows as text.
peroyomas
A: 

I found a way to fix it: First, change the shortcode regex from:

(.?)\[('.$tagregexp.')\b(.*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)

To:

(.?)\[('.$tagregexp.')\b((?:[^\[\]]|(?R)|.)*?)(?:(\/))?\](?:(.+?)\[\/\2\])?(.?)

And then change the priority of the do_shortcode function to avoid conflict with wptexturize, the function that stylize the quotes and mess up this fix. It don't have problems with wpautop because that's somewhat fixed with another recent function I think.

Before:

add_filter('the_content', 'do_shortcode', 11); // AFTER wpautop() 

After:

add_filter('the_content', 'do_shortcode', 9);  

I submitted this to the trac and is on some kind of permanent hiatus. In the meanwhile I figure if I can make a plugin to apply my fix without changing the core files. Override the filter priority is easy, but I have no idea of how to override the regex.

peroyomas
A: 

This would be nice to fix! I do not have sufficient rep to comment, so I am leaving the following related wordpress trac link, maybe it is the same as the one you meant: http://core.trac.wordpress.org/ticket/14481

I would hope that any fix would allow shortcode syntax like

[shortcode att1="val]ue"]content[/shortcode]

since in 3.0.1 the $content is mis-parsed as ue"]content instead of just content

Update: After spending time learning about regices (regexes?) I made it possible to allow ] and Pascal-style escaped quotes (eg arg='that''s [so] great') in these arguments with 2 changes: first change the (.*?) group in the first regex (get_shortcode_regex) to

((?:[^'"\]]|'[^']*'|"[^"]*")*)

(NB: make sure you escape everything properly in your php code) then in shortcode_parse_atts (the function containing the second regex) change the following (again, change ' to \' if you single-quote $pattern like in the original code)

in $pattern change "([^"]*)" to "((?:[^"]|"")*)"
in $pattern change '([^']*)' to '((?:[^']|'')*)'
$atts[strtolower($m[1])] = preg_replace('_""_', '"', stripcslashes($m[2]));
$atts[strtolower($m[3])] = preg_replace("_''_", "'", stripcslashes($m[4]));

NB again: changes to pattern may rely on greedy nature of matching so if that option's ever changed, the changed bits of $pattern might have to be terminated with something like (?!"), etc

daveagp