tags:

views:

39

answers:

1

First things first. I know how to parse XML/HTML with simplexml, and I know all the arguments against using RegEx to parse it. This question is for the sake of knowledge.

What needs to happen

In a block of text let's say we have the following line of text:

The query you need to use is 
<code>SELECT `post_name` FROM table WHERE id= $id</code> 
where `$id` is the `user_ID` we got earlier.

How do you match the following:

 `$id`
 `user_ID`

without also matching

`post_name`?

Requirements

This needs to be a regex-only solution. I understand and know how to use things like preg_replace_callback etc to remove <code> blocks from the string first, but I'm looking for a regex-only solution. Also, it needs to be able to handle possible attributes like<code lang="php">.

The regex needs to match pairs of backticks that are not between <code> and </code> and the matches may not contain either <code> or </code> to handle lone backticks in other contexts.

The content in the backticks will never be multi-lined.

Reasoning

I'm working on a personal project where this was a possible edge-case. This is not a mark-down type project where it is possible to change the order of the calls. The <code> tags are in the source text and not going anywhere.

Also, part of the reason I don't want "use simpleXML" answers is because the backticks are not inside actual <code> blocks. It is just a handy way to explain the problem and the solution for <code> blocks will work with slight changes.

+3  A: 

I don't think regular expressions are a good tool for this, but it can be done if you assume that the code tags aren't nested:

`(?:(?!</?code>)[^`])*`(?!(?:(?!<code>).)*</code>)

This means:

`(?:(?!</?code>)[^`])*`       : Match something in backticks unless it
                                contains <code> or </code> or a backtick...
(?!(?:(?!<code>).)*</code>)   : unless it is followed by a </code>
                                without a <code> first.

See the regular expression in action at rubular.

Mark Byers
Perfect. The only change I would make is to replace `.` with `[\s\S]` to handle multi-line `<code>` blocks. I know they aren't the best tool, but I was curious to see how it would be done. No worries, it won't be popping up in a project. =)
Aaron Harun
Instead of using `[\s\S]` you could simply set the `s` modifier (`PCRE_DOTALL`)
nikic