views:

1400

answers:

5

Following on from a previous question in which I asked:

How can I use a regular expression to match text that is between two strings, where those two strings are themselves enclosed two other strings, with any amount of text between the inner and outer enclosing strings?

I got this answer:

/outer-start.*?inner-start(.*?)inner-end.*?outer-end/

I would now like to know how to exclude certain strings from the text between the outer enclosing strings and the inner enclosing strings.

For example, if I have this text:

outer-start some text inner-start text-that-i-want inner-end some more text outer-end

I would like 'some text' and 'some more text' not to contain the word 'unwanted'.

In other words, this is OK:

outer-start some wanted text inner-start text-that-i-want inner-end some more wanted text outer-end

But this is not OK:

outer-start some unwanted text inner-start text-that-i-want inner-end some more unwanted text outer-end

Or to explain further, the expression between outer and inner delimiters in the previous answer above should exclude the word 'unwanted'.

Is this easy to match using regexes?

A: 

Try replacing the last .*? with: (?!(.*unwanted text.*))

Did it work?

Oren
If you're unsure (and even if you think you're sure), you should test your pattern locally (or on a site like http://codepad.org/), which is why regex questions need good examples (both passing and failing).
Roger Pate
+1  A: 

You can replace .*? with

 ([^u]|u[^n]|un[^w]|unw[^a]|unwa[^n]|unwan[^t]|unwant[^e]|unwante[^d])*?

This is a solution in "pure" regex; the language you are using might allow you to use some more elegant construct.

Heinzi
+2  A: 

Replace the first and last (but not the middle) .*? with (?:(?!unwanted).)*?. (Where (?:...) is a non-capturing group, and (?!...) is a negative lookahead.)

However, this quickly ends up with corner cases and caveats in any real (instead of example) use, and if you would ask about what you're really doing (with real examples, even if they're simplified, instead of made up examples), you'll likely get better answers.

Roger Pate
That's a better solution than mine.
Ken Fox
+1  A: 

You can't easily do that with plain regexes, but some systems such as Perl have extensions that make it easier. One way is to use a negative look-ahead assertion:

/outer-start(?:u(?!nwanted)|[^u])*?inner-start(.*?)inner-end.*?outer-end/

The key is to split up the "unwanted" into ("u" not followed by "nwanted") or (not "u"). That allows the pattern to advance, but will still find and reject all "unwanted" strings.

People may start hating your code if you do much of this though. ;)

Ken Fox
+2  A: 

A better question to ask yourself than "how do I do this with regular expressions?" is "how do I do solve this problem?". In other words, don't get hung up on trying to solve a big problem with regular expressions. If you can solve half the problem with regular expressions, do so, then solve the other half with another regular expression or some other technique.

For example, make a pass over your data getting all matches, ignoring the unwanted text (read: get results both with and without the unwanted text). Then, make a pass over the reduced set of data and weed out those results that have the unwanted text. This sort of a solution is easier to write, easier to understand and easier to maintain over time. And for any problem you're likely to need to solve with this approach it will be sufficiently fast enough.

Bryan Oakley