tags:

views:

297

answers:

3

I've been on this one for a while and can't seem to work it out. Here's what I am trying to do. Given three words word1, word2 and word3, I would like to construct a regex which will match them in that order, but with a set of potential words between them (except a new line).

For example, if I had the following:

word1 = what
word2 = the
word3 = hell

I would like to match the following strings, with a single match:

"what the hell"
"what in the hell"
"what the effing hell"
"what in the 9 doors of hell"

I thought I could do the following (allowing for 0 to 5 words to exist between each word variable):

regex = "\bword1(\b\w+\b){0,5}word2(\b\w+\b){0,5}word3\b"

Alas, no, it doesn't work. It's important that I have a way to specify a m to n word distance between words (where m always < n).

+2  A: 

"\bwhat(\s*\b\w*\b\s*){0,5}the(\s*\b\w*\b\s*){0,5}hell" works for me (in Ruby)

list = ["what the hell", "what in the hell", "what the effing hell", 
  "what in the 9 doors of hell", "no match here hell", "what match here hell"]

list.map{|i| /\bwhat(\s*\b\w*\b\s*){0,5}the(\s*\b\w*\b\s*){0,5}hell/.match(i) }
=> [#<MatchData:0x12c4d1c>, #<MatchData:0x12c4d08>, #<MatchData:0x12c4cf4>,
   #<MatchData:0x12c4ce0>, nil, nil]
Ben Hughes
This is matching both the whole phrase and returning a result of '' for group(1).I also tried (\s*\w*\s*){0,5} with the same result. This is much further than I got on my own though! Any suggestions? I'm doing this in Python, in case that matters.
Brandon Watson
+1  A: 
$ cat try
#! /usr/bin/perl

use warnings;
use strict;

my @strings = (
  "what the hell",
  "what in the hell",
  "what the effing hell",
  "what in the 9 doors of hell",
  "hello",
  "what the",
  " what the hell",
  "what the hell ",
);

for (@strings) {
  print "$_: ", /^what(\s+\w+){0,5}\s+the(\s+\w+){0,5}\s+hell$/
                  ? "match\n"
                  : "no match\n";
}

$ ./try
what the hell: match
what in the hell: match
what the effing hell: match
what in the 9 doors of hell: match
hello: no match
what the: no match
 what the hell: no match
what the hell : no match
Greg Bacon
This is the most elegant thus far, and it works as advertised, but there are sub matches. You tell me, do I care about that? I care most that the whole string is matched with word1 at the front, word2 somewhere in the middle and word3 at the end ("somewhere in the middle" is the word distance issue).
Brandon Watson
That's as straightforward as adding anchors to the pattern. Revised!
Greg Bacon
You shouldn't care about the sub matches. You can always just get the whole string that matches. In python, which you mention below, you do this via matchobj.group(0).If you're opposed to subgroups happening at all, just switch all parens from (\s+\w+) to (?:\s+\w+) so that they don't grab a subgroup.
Clint
Awesomeness ensues...
Brandon Watson
A: 

Works for me in clojure:

(def phrases ["what the hell" "what in the hell" "what the effing hell"
              "what in the 9 doors of hell"])

(def regexp #"\bwhat(\s*\b\w*\b\s*){0,5}the(\s*\b\w*\b\s*){0,5}hell")

(defn valid? []
  (every? identity (map #(re-matches regexp %) phrases)))

(valid?)  ; <-- true

as per Ben Hughes' pattern.

twopoint718