views:

341

answers:

3

Hey all,

I'm a total regexp noob. I'm working with wordpress and I'm desperately trying to deal with wordpress's wautop, which I hate and love (more hate!). Anyways I'm trying to remove <p> tags around certain commands.

Here's what I get:

<p>
[hide]
<img.../>
[/hide]
</p>

or

<p>
[imagelist]
<img .../>
<img .../>
[/imagelist]
</p>

Here's what I'd like:

[hide]
<img.../>
[/hide]

or

[imagelist]
<img .../>
<img .../>
[/imagelist]

I've tried:

preg_replace('/<p[^>]*>(\[[^>]*\])<\/p[^>]*>/', '$1', $content); // No luck!

EDIT: When I am doing the regexp it is still just a variable containing text.. It is not parsed as html yet. I know it is possible because I already did it with getting rid of p tags around an image tag. So I just need a regexp to handle text that will be parsed as html at some point in the future. Here's a similar question

Thanks! Matt Mueller

+5  A: 

You can't use regular expressions to parse HTML, because HTML is, by definition, a non-regular language. Period, end of discussion.

Paul Tomblin
Thanks for the response. I think I get what you are saying, though that's not what I meant. I edited it to clarify.
Matt
@Matt, just because it hasn't been put into <html> tags and displayed on a browser doesn't mean it isn't HTML and it isn't non-regular. You might think you've handled your test cases, but trust me, somebody will throw up cases that your regexp will break on.
Paul Tomblin
+3  A: 

The language of matching HTML tags is context-free, not regular. This means regular expressions are probably not the right tool to use here. Context-free languages require parsers rather than regular expressions. So, you can either remove ALL <p> and </p> tags with a regular expression, or you can use an HTML parser to remove matching tags from certain parts of your document.

danben
Its still text before its turned to html. see edited post please.
Matt
+1  A: 

Try this regex:

'%<p[^>]*>\s*(\[([^\[\]]+)\].*?\[/\2\])\s*</p>%s'

Explanation:

\[([^\[\]]+)\] matches the opening bbcode tag and captures the tag name in group #2.

\[/\2\] matches a corresponding losing tag.

.*? matches anything, reluctantly. Thanks to the s flag at the end, it also matches newlines. The effect of the reluctant .*? is that it stops matching the first time it finds a closing bbcode tag with the right name. If tags are nested (within tags with the same name) or improperly balanced, it won't work correctly. I wouldn't expect that be a problem, but I have no experience with WordPress, so YMMV.

Alan Moore
Thanks a lot for your help!
Matt