views:

142

answers:

1

I'm using Python and I want to use regular expressions to check if something "is part of an include list" but "is not part of an exclude list".

My include list is represented by a regex, for example:

And.*

Everything which starts with And.

Also the exclude list is represented by a regex, for example:

(?!Andrea)

Everything, but not the string Andrea. The exclude list is obviously a negation.

Using the two examples above, for example, I want to match everything which starts with And except for Andrea.

In the general case I have an includeRegEx and an excludeRegEx. I want to match everything which matchs includeRegEx but not matchs excludeRegEx. Attention: excludeRegEx is still in the negative form (as you can see in the example above), so it should be better to say: if something matches includeRegEx, I check if it also matches excludeRegEx, if it does, the match is satisfied. Is it possible to represent this in a single regular expression?

I think Conditional Regular Expressions could be the solution but I'm not really sure of that.

I'd like to see a working example in Python.

Thank you very much.

+1  A: 

Why not put both in one regex?

And(?!rea$).*

Since the lookahead only "looks ahead" without consuming any characters, this works just fine (well, this is the whole point of lookaround, actually).

So, in Python:

if re.match(r"And(?!rea$).*", subject):
    # Successful match 
    # Note that re.match always anchor the match
    # to the start of the string.
else:
    # Match attempt failed

From the wording of your question, I'm not sure if you're starting with two already finished lists of "match/don't match" pairs. In that case, you could simply combine them automatically by concatenating the regexes. This works just as well but is uglier:

(?!Andrea$)And.*

In general, then:

(?!excludeRegex$)includeRegex
Tim Pietzcker
The first example you've done is not a general solution to the problem. It works only in my example. In the general case I have *includeRegEx* and *excludeRegEx*.For your second example I can see that strings like AndreaXXXdoesn't match. I'd like them to match, instead (it starts with "And" but it is not "Andrea").
Luca
Well, that's what the second solution is for. I thought that abstracting from that would be trivial. Will edit to clarify.
Tim Pietzcker
It doesn't seem to me that *(?!excludeRegex)includeRegex* is the solution. It it was, also *(?!Andrea)And.** should work, but it doesn't, as I've showed in my previous comment.
Luca
OK, added an end-of-string anchor (`$`) to make this happen.
Tim Pietzcker
@Luca, forget about conditionals, this is the solution. For maximum flexibility you can express the *include* conditions as lookaheads too. Then just run them all together: `^(?=And)(?=.{6}$)(?!Andrea$)` -- "starts with `And`, six characters long, not `Andrea`"
Alan Moore
Yes. It seems to work as I wanted. Thank you very much!
Luca
Actually, I would use `r'And(?!rea\b).*'` instead. You want the word to end there, not the whole line.
ΤΖΩΤΖΙΟΥ