tags:

views:

478

answers:

5

I have a string which has several html comments in it. I need to count the unique matches of an expression.

For example, the string might be:

var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";

I currently use this to get the matches:

var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);

The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.

I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.

Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.

A: 

Extract the comments and store them in an array. Then you can filter out the unique values.

But I don’t know how to implement this in C#.

Gumbo
+4  A: 

I would just use the Enumerable.Distinct Method for example like this:

string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(@"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
    .OfType<Match>()
    .Select(m => m.Value)
    .Distinct();

uniqueMatches.ToList().ForEach(Console.WriteLine);

Outputs this:

<!--X1-->  
<!--X2-->


For regular expression, you could maybe use this one?

(<!--X\d-->)(?!.*\1.*)

Seems to work on your test string in RegexBuddy at least =)

// (<!--X\d-->)(?!.*\1.*)
// 
// Options: dot matches newline
// 
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
//    Match the characters “<!--X” literally «<!--X»
//    Match a single digit 0..9 «\d»
//    Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
//    Match any single character «.*»
//       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//    Match the same text as most recently matched by capturing group number 1 «\1»
//    Match any single character «.*»
//       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Svish
I liked this idea but unfortunately the results aren't what was expected. In my unit test (which has much larger string) I got 8 results when I should have received 4. Not sure what the difference between RegexBuddy and what I'm using is. :(
Sailing Judo
Also, I tried using Distinct() but the MatchCollection, even though it derives from IEnumerable, doesn't seem to recognize this.
Sailing Judo
what is your much larger string? On the MatchCollection you most likely have to use var stuff = theMatchCollection.OfType<Match>().Select(m => m.Value).Distinct(), or something.
Svish
Couldnt possibly paste it here... generally makes a 4k html file. I'm Looking into the Distinct more. Getting closer... current version looks similar to what type above. :) Linq and Lambdas are still a little new to me.
Sailing Judo
brilliant! great answer... would have taken me an 30 minutes to figure out that revised example on my own.
Sailing Judo
paste it somewhere else then :) I want to see why my regex fails :p
Svish
Great info. I wasn't aware of OfType(). If I could I would give you +3 :)
Brian Rasmussen
@Brian, you can try to find two other users and talk them into upvoting me as well :p
Svish
A: 

Capture the inner portion of the comment as a group. Then put those strings into a hashtable(dictionary). Then ask the dictionary for its count, since it will self weed out repeats.

var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
var tokens = new Dicationary<string, string>();
Regex.Replace(teststring, @"<!--(.*)-->",   
     match => {  
     tokens[match.Groups[1].Value] = match.Groups[1].Valuel;  
     return ""; 
     });
var uniques = tokens.Keys.Count;

By using the Regex.Replace construct you get to have a lambda called on each match. Since you are not interested in the replace, you don't set it equal to anything.

You must use Group[1] because group[0] is the entire match. I'm only repeating the same thing on both sides, so that its easier to put into the dictionary, which only stores unique keys.

DevelopingChris
A: 

Depending on how many Xn's you have you might be able to use:

(\<!--X1--\>){1}.*(\<!--X2--\>){1}

That will only match each occurrence of the X1, X2 etc. once provided they are in order.

sipwiz
+1  A: 

It appears you're doing two different things:

  1. Matching comments like /<-- X. -->/
  2. Finding the set of unique comments

So it is fairly logical to handle these as two different steps:

var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);

var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());

class MatchComparer : IEqualityComparer<Match>
{
    public bool Equals(Match a, Match b)
    {
        return a.Value == b.Value;
    }

    public int GetHashCode(Match match)
    {
        return match.Value.GetHashCode();
    }
}
sixlettervariables
Have you tested this? For some reason I cannot get Distinct() to work with the MatchCollection even though this is the second answer that included it. I'm using .NET3.5 and have System.Linq in my using statements.
Sailing Judo
Fixed the code such that it works.
sixlettervariables
you should use OfType and not Cast
Svish
I want to make sure it throws an exception if anything but a Match shows up. OfType will go ahead and ignore things which may not be Match, which could hide underlying problems.
sixlettervariables