views:

176

answers:

4

Suppose I had a string, "cats cats cats and dogs dogs dogs."

What regular expression would I use in order to replace that string with,"cats and dogs." i.e. removing duplicates. The expression however must only remove duplicates that follow after each other. For instance:

"cats cats cats and dogs dogs dogs and cats cats and dogs dogs"

Would return:

"cats and dogs and cats and dogs"

+2  A: 

Replace (\w+)\s+\1 with $1

Do this in a loop until no more matches are found. Setting the global flag is not enough as it wouldn't replace third cats in cats cats cats

\1 in regex refers to the contents of the first captured group.

Try:

str = "cats cats cats and dogs dogs dogs and cats cats and dogs dogs";
str = Regex.Replace(str, @"(\b\w+\b)\s+(\1(\s+|$))+", "$1 ");
Console.WriteLine(str);
Amarghosh
I'm using this code: replacer = Regex.Replace(replacer, @"([\\n]+)[\s+]?\1", string.Empty); but it doesn't seem to work. It works in rubular though http://www.rubular.com/r/Ey6wrLYXNw
Immanu'el Smith
@Emmanuel Try `str = Regex.Replace(str, @"(\w+)\s+\1", "$1");`
Amarghosh
Why was this down-voted?
Amarghosh
@Emmanuel see my update
Amarghosh
+1  A: 

No doubt there is a smaller regex possible, but this one seems to do the trick:

string somestring = "cats cats cats and dogs dogs dogs and cats cats and dogs dogs";
Regex regex = new Regex(@"(\w+)\s(?:\1\s)*(?:\1(\s|$))");
string result = regex.Replace(somestring, "$1$2");

It also takes into account the last "dogs" not ending with a space.

deltreme
This will remove too many spaces: `cats cats cats and dogs dogs dogs and cats cats and dogs dogs` becomes `catsand dogsand catsand dogs`. It also matches too much: `Michael Bolton on CD` becomes `Michael BoltonCD`. Sorry about the Office Space reference.
Tim Pietzcker
Weird, I can't seem to reproduce those errors. Perhaps I should add some more pieces of flair :]
deltreme
Oops, I missed that you are replacing with `$1$2`, so the first problem I thought I saw is not there. But Michael Bolton still has a problem. Perhaps some hypnosis will help (or a word boundary `\b` before the `\w`).
Tim Pietzcker
+5  A: 
resultString = Regex.Replace(subjectString, @"\b(\w+)(?:\s+\1\b)+", "$1");

will do all replacements in one single call.

Explanation:

\b                 # assert that we are at a word boundary
                   # (we only want to match whole words)
(\w+)              # match one word, capture into backreference #1
(?:                # start of non-capturing, repeating group
   \s+             # match at least one space
   \1              # match the same word as previously captured
   \b              # as long as we match it completely
)+                 # do this at least once
Tim Pietzcker
Tim, you're a regex guru. Respect! :)
Koen
A: 
Stephan