views:

73

answers:

1

How to construct a regular expression search pattern to find string1 that is not followed by string2 (immediately or not)?

For for instance, if string1="MAN" and string2="PN", example search results would be:

"M": Not found
"MA": Not found
"MAN": Found
"BLAH_MAN_BLEH": Found
"MAN_PN": Not found
"BLAH_MAN_BLEH_PN": Not found
"BLAH_MAN_BLEH_PN_MAN": Not found

Ideally, a one-linear search, instead of doing a second search for string2.

PS: Language being used is Python

+2  A: 

It looks like you can use MAN(?!.*PN). This matches MAN and uses negative lookahead to make sure that it's not followed by PN (as seen on rubular.com).

Given MAN_PN_MAN_BLEH, the above pattern will find the second MAN, since it's not followed by PN. If you want to validate the entire string and make sure that there's no MAN.*PN, then you can use something like ^(?!.*MAN.*PN).*MAN.*$ (as seen on rubular.com).

References

Related questions


Non-regex option

If the strings are to be matched literally, then you can also check for indices of substring occurrences.

In Python, find and rfind return lowest and highest index of substring occurrences respectively.

So to make sure that string1 occurs but never followed by string2, and both returns -1 if the string is not found, so it looks like you can just test for this condition:

string.rfind(s, string2) < string.find(s, string1)

This compares the leftmost occurrence of string1 and the rightmost occurrence of string2.

  • If neither occurs, both are -1, and result is false
  • If string1 occurs, but string2 doesn't, then result is true as expected
  • If both occurs, then the rightmost string2 must be to the left of the leftmost string1
    • That is, no string1 is ever followed by string2

API links

polygenelubricants
In the intended example, it would not be a match. So, if string1 occurs, but is ever followed by string2, it would negate the find.
apalopohapa
@apalopohapa: I'm still not sure if I get this 100% correct, so feel free to edit the question later with more examples and/or unaccept my answer if it fails you on some input. I will come back to this several hours from now and see if there are any unresolved issues.
polygenelubricants
@apalopohapa: Also, I think `MAN(?!.*P(?=.*N?))` is just `MAN(?!.*P)`, which then opens up possibilities of `MAN[^P]*` etc. The more information we have, the better answers we can give.
polygenelubricants
@apalopohapa: I think `MAN(?!.*P(?=.*N))` is just `MAN(?!.*P.*N)`
polygenelubricants
-1 Don't use $ to match end of string, use \Z. "." doesn't match newline by default; you need to use the re.DOTALL flag.
John Machin