views:

2094

answers:

10

Using regular expressions in C#, is there any way to find and remove duplicate words or symbols in a string containing a variety of words and symbols?

Ex.

Initial string of words:

"I like the environment. The environment is good."

Desired string:

"I like the environment. is good"

Duplicates removed: "the", "environment", "."

+4  A: 

Well, Jeff has shown me how to use the magic of in-expression backreferences and the global modifier to make this one happen, so my original answer is inoperative. You should all go vote for Jeff's answer. However, for posterity I'll note that there's a tricky little regex engine sensitivity issue in this one, and if you were using Perl-flavored regex, you would need to do this:

\b(\S+)\b(?=.*\b\1\b.*)

instead of Jeff's answer, because C#'s regex will effectively capture \b in \1 but PCRE will not.

chaos
We've all been down this road... "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."
Jon Freeland
But are there any regex engines today that don't support any kind of statefulness these days? This is a pretty straightforward task with backreferences. In fact, I think something like this is used to demonstrate backreferences in the Camel book (Programming Perl).
arnsholt
yeah, well, see my regex below which works
Jeff Atwood
I'm telling you, the match itself has the boundaries. I can screenshot it for you in RegexBuddy with a bunch of test cases if you want!
Jeff Atwood
I care not for this RegexBuddy of which you speak. If it works in C# code, then okay, but it doesn't work in Perl.
chaos
oh ye of little faith. I just ran this against the production db to remove duplicate tags -- using a .NET CLR regex! That said, of *course* there are tons of flavor specific issues. And when I switch RegexBuddy to "Perl" flavor (it's so awesome, it lets you switch engines on the fly), it doesn't match. But the OP asked for c#, no? :)
Jeff Atwood
Yes, he did. I guess RegexBuddy ain't so bad, then! Looks like I've got another edit coming...
chaos
+1  A: 

Have a look at backreferences:
http://msdn.microsoft.com/en-us/library/thwdfzxy(VS.71).aspx

This a regex that will find doubled words. But it will match only one word per match. So you have to use it more than once.

new Regex( @"(.*)\b(\w+)\b(.*)(\2)(.*)", RegexOptions.IgnoreCase );

Of course this is not the best solution (see other answers, which propose not to use a regex at all). But you asked for a regex - here is one. Maybe just the idea helps you ...

tanascius
+1  A: 

Regex is not suited for everything. Something like your problem does fall into that category. I would advise you to use a parser instead.

Tobias Hertkorn
A: 

I don't know howto this, but search in web i found some link can help you find the way.

This link not is c# code. But i say this is only for find the way.

Check this urls

link 1 link 2 link 3

pho3nix
A: 

As others have pointed out, this is doable with backreferences. See http://msdn.microsoft.com/nb-no/library/thwdfzxy(en-us).aspx for the details on how to use backreferences in .Net.

Your particular problem to remove punctuation as well makes it a bit more complicated, but I think code along these lines (whitespace is not significant in that regex) should do the trick:

(\b\w+(?:\s+\w+)*)\s+\1

I've not tested the regex at all, but that should match one or more words separated by whitespace that are repeated. You'll have to add some more logic to allow for puncuation and so on.

arnsholt
doesn't really work...
Jeff Atwood
+9  A: 

As said by others, you need more than a regex to keep track of words:

var words = new HashSet<string>();
string text = "I like the environment. The environment is good.";
text = Regex.Replace(text, "\\w+", m =>
                     words.Add(m.Value.ToUpperInvariant())
                         ? m.Value
                         : String.Empty);
Per Erik Stendahl
ToUpperInvariant is preferred to ToLower, and if you have lambdas, you have HashSet<string> which replaces Dictionary<string,string> where Key==Value. Otherwise, +1.
sixlettervariables
Thanks. Are there any performance gains from using ToUpperInvariant or is just convention?
Per Erik Stendahl
The HashSet constructor takes an optional IEqualityComparer, and its Add method returns a boolean indicating whether or not the item already exists in the set. So you could instantiate your HashSet with "var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase);" and then reduce your delegate to a one-liner: "return words.Add(m.Value) ? m.Value : string.Empty;"
LukeH
MS special cases ToUpperInvariant for speed, granted I can't find where I saw this at the moment.
sixlettervariables
A: 

You won't be able to use regular expressions for this problem, because regex only matches regular languages. The pattern you are trying to match is context-sensitive, and therefore not "regular."

Fortunately, it is easy enough to write a parser. Have a look at Per Erik Stendahl's code.

Matt Bridges
+1  A: 

Regular expressions would be a poor choice of "tools" to solve this problem. Perhaps the following could work:

HashSet<string> corpus = new HashSet<string>();
char[] split = new char[] { ' ', '\t', '\r', '\n', '.', ';', ',', ':', ... };

foreach (string line in inputLines)
{
    string[] parts = line.Split(split, StringSplitOptions.RemoveEmptyEntries);
    foreach (string part in parts)
    {
        corpus.Add(part.ToUpperInvariant());
    }
}

// 'corpus' now contains all of the unique tokens

EDIT: This is me making a big assumption that you're "lexing" for some sort of analysis like searching.

sixlettervariables
+2  A: 

This seems to work for me

(\b\S+\b)(?=.*\1)

Matches like so

apple apple orange  
orange red blue green orange green blue  
pirates ninjas cowboys ninjas pirates  
Jeff Atwood
Does this do a case insensitive match?
Robert
Also it looks like he wants to match the second instance of the word, not the first.
Robert
Also try it on `pirates ninjas cowboys ninjas pirates dallascowboys`.
chaos
well, you could use a backref instead of a forward ref.. `(?<=.*\1.*)(\b\S+\b)`
Jeff Atwood
@chaos -- RegexBuddy says no match. remember, the match itself is boundary, stuff, boundary
Jeff Atwood
That's not what happens when I tried it in actual code. In Perl, at least, `\b` is zero-width and `\1` isn't capturing it. Conceivably something else might happen in C#'s regex engine though.
chaos
So the real problem is that Perl's regex engine is "teh suck". Duly noted. :) But the OP asked for C# code.. it's actually pretty amazing how good the regex engine in .NET is. http://www.regular-expressions.info/refflavors.html
Jeff Atwood
Ooh, cheap shot at Perl regex! As if C#'s would be anywhere without it. I'd take away your +1 for that if you hadn't edupwned me so helpfully in this thread.
chaos
How about "it can can't it"? \b picks up the boundary between letters and apostrophes in Perl-land, but we've already established that rules are different there :)
Cebjyre
A: 

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

See When not to use Regex in C# (or Java, C++ etc)

Of course using a regex to split the string into words may be a useful first step, however String.Split() is clear and it lickly to do everything you need.

Ian Ringrose