tags:

views:

2863

answers:

5

I need to create a regex that can match multiple strings. For example, I want to find all the instances of "good" or "great". I found some examples, but what I came up with doesn't seem to work:

\b(good|great)\w*\b

Can anyone point me in the right direction?

Edit: I should note that I don't want to just match whole words. For example, I may want to match "ood" or "reat" as well (parts of the words).

Edit 2: Here is some sample text: "This is a really great story." I might want to match "this" or "really", or I might want to match "eall" or "reat".

+1  A: 
(good)*(great)*

after your edit:

\b(g*o*o*d*)*(g*r*e*a*t*)*\b
Chris Ballance
Won't that match ooooooooooooooooooooooooooooooooooooooooddddddddddddddddddddddddddddd?
C. Ross
Yes, along with "gore", "gogogo", and a bunch of other unintended combinations.
Randy
+3  A: 

If you can guarantee that there are no "reserved" characters in your word list, you could just use this code to make a big word list into @"\b(a|big|word|list)\b". There's nothing wrong with the | operator as you're using it, as long as those () surround it.

String[] word_list = whatever;
String regex = String.Format(@"\b({0})\w*\b",
    String.Join(word_list, "|"));

Note that the \w* won't be captured -- if you're trying to do anything but confirm that one of the words is contained in the string, you'll want to make a larger capturing group around the whole thing. Also bear in mind that a search for "library" will miss the word "libraries." If you're trying to solve that problem, you might want to read up on stemming.

If your list isn't carefully picked for regex (i.e., if it might include any regex characters, like +*?()[]{}), you could convert the strings a little more. So, if your word is gr+eat, you could automatically convert it to [g][r][+][e][a][t] without too much thought, and avoid the accidental regex. Just make sure to convert "[" and "]" to "[\\[]" and "[\\]]", or you'll get weird syntax errors at best, and at worst, no errors and wrong behavior.

ojrac
A: 

I don't understand the problem correctly:

If you want to match "great" or "reat" you can express this by a pattern like:

"g?reat"

This simply says that the "reat"-part must exist and the "g" is optional.

This would match "reat" and "great" but not "eat", because the first "r" in "reat" is required.

If you have the too words "great" and "good" and you want to match them both with an optional "g" you can write this like this:

(g?reat|g?ood)

And if you want to include a word-boundary like:

\b(g?reat|g?ood)

You should be aware that this would not match anything like "breat" because you have the "reat" but the "r" is not at the word boundary because of the "b".

So if you want to match whole words that contain a substring link "reat" or "ood" then you should try:

"\b\w*?(reat|ood)\w+\b"

This reads: 1. Beginning with a word boundary begin matching any number word-characters, but don't be gready. 2. Match "reat" or "ood" enshures that only those words are matched that contain one of them. 3. Match any number of word characters following "reat" or "ood" until the next word boundary is reached.

This will match:

"goodness", "good", "ood" (if a complete word)

It can be read as: Give me all complete words that contain "ood" or "reat".

Is that what you are looking for?

A: 

I'm not entirely sure that regex alone offers a solution for what you're trying to do. You could, however, use the following code to create a regex expression for a given word. Although, the resulting regex pattern has the potential to become very long and slow:

function wordPermutations( $word, $minLength = 2 )
{
    $perms = array( );

    for ($start = 0; $start < strlen( $word ); $start++)
    {
     for ($end = strlen( $word ); $end > $start; $end--)
     {
      $perm = substr( $word, $start, ($end - $start));

      if (strlen( $perm ) >= $minLength)
      {
       $perms[] = $perm;
      }
     }
    }

    return $perms;
}

Test Code:

$perms = wordPermutations( 'great', 3 );  // get all permutations of "great" that are 3 or more chars in length
var_dump( $perms );

echo ( '/\b('.implode( '|', $perms ).')\b/' );

Example Output:

array
  0 => string 'great' (length=5)
  1 => string 'grea' (length=4)
  2 => string 'gre' (length=3)
  3 => string 'reat' (length=4)
  4 => string 'rea' (length=3)
  5 => string 'eat' (length=3)

/\b(great|grea|gre|reat|rea|eat)\b/
KOGI
A: 

It sounds like you're talking about stemming. That's a very complex task, and totally unsuited to regexes. I suggest you look for a stemming library for C# (like this one, for example).

Alan Moore