ansaurus

Question

Regular expression to find and remove duplicate words

Answer 1

+4 A:

Well, Jeff has shown me how to use the magic of in-expression backreferences and the global modifier to make this one happen, so my original answer is inoperative. You should all go vote for Jeff's answer. However, for posterity I'll note that there's a tricky little regex engine sensitivity issue in this one, and if you were using Perl-flavored regex, you would need to do this:

\b(\S+)\b(?=.*\b\1\b.*)

instead of Jeff's answer, because C#'s regex will effectively capture \b in \1 but PCRE will not.

chaos 2009-06-29 15:00:52

We've all been down this road... "Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Jon Freeland 2009-06-29 15:12:19

But are there any regex engines today that don't support any kind of statefulness these days? This is a pretty straightforward task with backreferences. In fact, I think something like this is used to demonstrate backreferences in the Camel book (Programming Perl).

arnsholt 2009-06-29 15:13:45

yeah, well, see my regex below which works

Jeff Atwood 2009-07-31 13:13:37

I'm telling you, the match itself has the boundaries. I can screenshot it for you in RegexBuddy with a bunch of test cases if you want!

Jeff Atwood 2009-07-31 13:28:13

I care not for this RegexBuddy of which you speak. If it works in C# code, then okay, but it doesn't work in Perl.

chaos 2009-07-31 13:29:33

oh ye of little faith. I just ran this against the production db to remove duplicate tags -- using a .NET CLR regex! That said, of *course* there are tons of flavor specific issues. And when I switch RegexBuddy to "Perl" flavor (it's so awesome, it lets you switch engines on the fly), it doesn't match. But the OP asked for c#, no? :)

Jeff Atwood 2009-07-31 13:34:13

Yes, he did. I guess RegexBuddy ain't so bad, then! Looks like I've got another edit coming...

chaos 2009-07-31 13:36:32

Answer 2

+1 A:

Have a look at backreferences:
http://msdn.microsoft.com/en-us/library/thwdfzxy(VS.71).aspx

This a regex that will find doubled words. But it will match only one word per match. So you have to use it more than once.

new Regex( @"(.*)\b(\w+)\b(.*)(\2)(.*)", RegexOptions.IgnoreCase );

Of course this is not the best solution (see other answers, which propose not to use a regex at all). But you asked for a regex - here is one. Maybe just the idea helps you ...

tanascius 2009-06-29 15:02:51

Answer 3

+1 A:

Regex is not suited for everything. Something like your problem does fall into that category. I would advise you to use a parser instead.

Tobias Hertkorn 2009-06-29 15:05:45

Answer 4

A:

I don't know howto this, but search in web i found some link can help you find the way.

This link not is c# code. But i say this is only for find the way.

Check this urls

link 1 link 2 link 3

pho3nix 2009-06-29 15:07:14

Answer 5

A:

As others have pointed out, this is doable with backreferences. See http://msdn.microsoft.com/nb-no/library/thwdfzxy(en-us).aspx for the details on how to use backreferences in .Net.

Your particular problem to remove punctuation as well makes it a bit more complicated, but I think code along these lines (whitespace is not significant in that regex) should do the trick:

(\b\w+(?:\s+\w+)*)\s+\1

I've not tested the regex at all, but that should match one or more words separated by whitespace that are repeated. You'll have to add some more logic to allow for puncuation and so on.

arnsholt 2009-06-29 15:11:49

doesn't really work...

Jeff Atwood 2009-07-31 12:01:51

Answer 6

+9 A:

As said by others, you need more than a regex to keep track of words:

var words = new HashSet<string>();
string text = "I like the environment. The environment is good.";
text = Regex.Replace(text, "\\w+", m =>
                     words.Add(m.Value.ToUpperInvariant())
                         ? m.Value
                         : String.Empty);

Per Erik Stendahl 2009-06-29 15:12:08

ToUpperInvariant is preferred to ToLower, and if you have lambdas, you have HashSet<string> which replaces Dictionary<string,string> where Key==Value. Otherwise, +1.

sixlettervariables 2009-06-29 15:19:35

Thanks. Are there any performance gains from using ToUpperInvariant or is just convention?

Per Erik Stendahl 2009-06-29 15:43:22

The HashSet constructor takes an optional IEqualityComparer, and its Add method returns a boolean indicating whether or not the item already exists in the set. So you could instantiate your HashSet with "var words = new HashSet<string>(StringComparer.OrdinalIgnoreCase);" and then reduce your delegate to a one-liner: "return words.Add(m.Value) ? m.Value : string.Empty;"

LukeH 2009-06-29 15:57:50

MS special cases ToUpperInvariant for speed, granted I can't find where I saw this at the moment.

sixlettervariables 2009-06-29 16:36:39

Answer 7

A:

You won't be able to use regular expressions for this problem, because regex only matches regular languages. The pattern you are trying to match is context-sensitive, and therefore not "regular."

Fortunately, it is easy enough to write a parser. Have a look at Per Erik Stendahl's code.

Matt Bridges 2009-06-29 15:14:14

Answer 8

+1 A:

Regular expressions would be a poor choice of "tools" to solve this problem. Perhaps the following could work:

HashSet<string> corpus = new HashSet<string>();
char[] split = new char[] { ' ', '\t', '\r', '\n', '.', ';', ',', ':', ... };

foreach (string line in inputLines)
{
    string[] parts = line.Split(split, StringSplitOptions.RemoveEmptyEntries);
    foreach (string part in parts)
    {
        corpus.Add(part.ToUpperInvariant());
    }
}

// 'corpus' now contains all of the unique tokens

EDIT: This is me making a big assumption that you're "lexing" for some sort of analysis like searching.

sixlettervariables 2009-06-29 15:14:44

Answer 9

+2 A:

This seems to work for me

(\b\S+\b)(?=.*\1)

Matches like so

apple apple orange  
orange red blue green orange green blue  
pirates ninjas cowboys ninjas pirates

Jeff Atwood 2009-07-31 11:52:01

Does this do a case insensitive match?

Robert 2009-07-31 13:21:12

Also it looks like he wants to match the second instance of the word, not the first.

Robert 2009-07-31 13:22:30

Also try it on `pirates ninjas cowboys ninjas pirates dallascowboys`.

chaos 2009-07-31 13:24:22

well, you could use a backref instead of a forward ref.. `(?<=.*\1.*)(\b\S+\b)`

Jeff Atwood 2009-07-31 13:24:31

@chaos -- RegexBuddy says no match. remember, the match itself is boundary, stuff, boundary

Jeff Atwood 2009-07-31 13:25:13

That's not what happens when I tried it in actual code. In Perl, at least, `\b` is zero-width and `\1` isn't capturing it. Conceivably something else might happen in C#'s regex engine though.

chaos 2009-07-31 13:28:16

So the real problem is that Perl's regex engine is "teh suck". Duly noted. :) But the OP asked for C# code.. it's actually pretty amazing how good the regex engine in .NET is. http://www.regular-expressions.info/refflavors.html

Jeff Atwood 2009-07-31 13:36:34

Ooh, cheap shot at Perl regex! As if C#'s would be anywhere without it. I'd take away your +1 for that if you hadn't edupwned me so helpfully in this thread.

chaos 2009-07-31 13:41:15

How about "it can can't it"? \b picks up the boundary between letters and apostrophes in Perl-land, but we've already established that rules are different there :)

Cebjyre 2009-07-31 13:46:45

Answer 10

A:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

See When not to use Regex in C# (or Java, C++ etc)

Of course using a regex to split the string into words may be a useful first step, however String.Split() is clear and it lickly to do everything you need.

Ian Ringrose 2009-07-31 12:04:38

ansaurus

tags:

views:

answers:

Regular expression to find and remove duplicate words

related questions