tags:

views:

132

answers:

5

Hi

I'm after a regex for C# which will turn this:

"*one*" *two** two and a bit "three four"

into this:

"*one*" "*two**" two and a bit "three four"

IE a quoted string should be unchanged whether it contains one or many words.

Any words with asterisks to be wrapped in double quotes.

Any unquoted words with no asterisks to be unchanged.

Nice to haves: If multiple asterisks could be merged into one in the same step that would be better. Noise words - eg and, a, the - which are not part of a quoted string should be dumped.

Thanks for any help / advice.

Julio

+1  A: 

The following regex will do what you're looking for:

\*+            # Match 1 or more *
 (
  \w+          # Capture character string
 )
\*+            # Match 1 or more *

If you use this in conjunction with this replace statement, all you words matched by (\w+) will be wrapped in "**":

string s = "\"one\" *two** two and a bit \"three four\"";
Regex r = new Regex(@"\*+(\w+)\*+");

var output = r.Replace(s, @"""*$1*""");

Note: This will leave the below string unquoted:

*two two*

If you wish to match those strings as well, use this regex:

\*+([^*]+)\*+
Gavin Miller
You need to use lazy operators so that strings with multiple matches will also work:*one* *two* threeSo use:\*+([^*]+?)\*+
Greg Miller
Do you mean a string like this: `*one* *two*`? If so, replace will work correctly.
Gavin Miller
@Greg Miller - Either something is wrong with the text in your comment, or you didn't catch the fact that "[^*]+" will match everything up to, but not including, the next asterisk without a lazy operator.
John Fisher
This pattern will miss the simple case of '**'
MizardX
+1  A: 

EDIT: updated code.

This solution works for your request, as well as the nice to have items:

string text = @"test the ""one"" and a *two** two and a the bit ""three four"" a";
string result = Regex.Replace(text, @"\*+(.*?)\*+", @"""*$1*""");
string noiseWordsPattern = @"(?<!"")  # match if double quote prefix is absent
 \b      # word boundary to prevent partial word matches
 (and|a|the)    # noise words
 \b      # word boundary
 (?!"")      # match if double quote suffix is absent
 ";

// to use the commented pattern use RegexOptions.IgnorePatternWhitespace
result = Regex.Replace(result, noiseWordsPattern, "", RegexOptions.IgnorePatternWhitespace);

// or use this one line version instead
// result = Regex.Replace(result, @"(?<!"")\b(and|a|the)\b(?!"")", "");

// remove extra spaces resulting from noise words replacement
result = Regex.Replace(result, @"\s+", " ");

Console.WriteLine("Original: {0}", text);
Console.WriteLine("Result: {0}", result);

Output:

Original: test the "one" and a *two** two and a the bit "three four" a
Result: test "one" "*two*" two bit "three four"

The 2nd regex replacement for noise words causes potential duplicate of blank spaces. To remedy this side effect I added the 3rd regex replacement to clean it up.

Ahmad Mageed
A: 

Something like this. ArgumentReplacer is a callback that is called for each match. The return value is substituted into the returned string.

void Main() {
    string text = "\"one\" *two** and a bit \"three *** four\"";

    string finderRegex = @"
        (""[^""]*"")           # quoted
      | ([^\s""*]*\*[^\s""]*)  # with asteriks
      | ([^\s""]+)             # without asteriks
    ";

    return Regex.Replace(text, finderRegex, ArgumentReplacer,
            RegexOptions.IgnorePatternWhitespace);
}

public static String ArgumentReplacer(Match theMatch) {

    // Don't touch quoted arguments, and arguments with no asteriks
    if (theMatch.Groups[2].Value.Length == 0)
        return theMatch.Value;

    // Quote arguments with asteriks, and replace sequences of such
    // by a single one.
    return String.Format("\"%s\"",
          Regex.Replace(theMatch.Value, @"\*\*+", "*"));
}

Alternatives to the left in the pattern has priority over those to the right. This is why I just needed to write "[^\s""]+" in the last alternative.

The quotes, on the other hand, are only matched if they occur at the beginning of the argument. They will not be detected if they occur in the middle of the argument, and we must stop before those if they occur.

MizardX
A: 

Given that you wish to match pairs of quotes, I don’t think your language is regular, therefore I don’t think RegEx is a good solution. E.g

Some people, when confronted with a problem, think “I know, I'll use regular expressions.”
Now they have two problems.

See "When not to use Regex in C# (or Java, C++ etc)"

Ian Ringrose
http://www.google.com/search?q=site%3Astackoverflow.com+regex+"now+(he+OR+they)+(have+OR+has)+two+problems" ;)
Alan Moore
A: 

I've decided to follow the advice of a couple of responses and go with a parser solution. I've tried the regexes contributed so far and they seem to fail in some cases. That's probably an indication that regexes aren't the appropriate solution to this problem. Thanks for all responses.

Julio