Hi, I am searching for a way to model a RegEx which would give a match for both of these strings when searched for "sun shining".
the sun is shining
a shining sun is nice
Thx
Hi, I am searching for a way to model a RegEx which would give a match for both of these strings when searched for "sun shining".
the sun is shining
a shining sun is nice
Thx
You will need to use a regular expression that considers every permutation like this:
\b(sun\b.+\bshining|shining\b.+\bsun)\b
Here the word boundaries \b
are used to only match the words sun
and shining
and no sub-words like in “sunny”.
Basic regular expressions don't handle differing orders of words very well. There are ways to do it but the regular expressions become ugly and unreadable to all but the regex gurus. I prefer to opt for readability in most cases myself.
My advice would be to use a simple or
variant, something like:
sun.+shining|shining.+sun
with word boundaries if necessary:
\bsun\b.+\bshining\b|\bshining\b.+\bsun\b
As Lucero points out, this will become unwieldy as the number of words your searching for increases, in which case I would go for the multiple regex match solution:
def hasAllWords (string, words[]):
count = words[].length()
for each word in words[]:
if not string.match ("\b" + word + "\b"):
return false
return true
That pseudo-code will run a check for each word and ensure that all of them appear.
I'd use positive lookaheads for each word, like this (and you can add as many as you like):
(?=.*?\bsun\b)(?=.*?\bshining\b).*
You use two regexes.
if ( ( $line =~ /\bsun\b.+\bshining\b/ ) ||
( $line =~ /\bshining\b.+\bsun\b/ ) ) {
# do whatever
}
Sometimes you have to do what seems to be low-tech. Other answers to this question will have you building complex regexes with alternation and lookahead and whatever, but sometimes the best way is to do it the simplest way, and in this case, it's to use two different regexes.
Don't worry about execution speed. Unless you benchmark this solution against other more complicated single-expression solutions, you don't know which is faster. It's incredibly easy to write slow regexes.