views:

145

answers:

4

I am fairly new to C# programming and I am stuck on my little ASP.NET project.

My website currently examines Twitter statuses for URLs and then adds those URLs to an array, all via a regular expression pattern matching procedure. Clearly more than one person will update a with a specific URL so I do not want to list duplicates, and I want to count the number of times a particular URL is mentioned in, say, 100 tweets.

Now I have a List<String> which I can sort so that all duplicate URLs are next to each other. I was under the impression that I could compare list[i] with list[i+1] and if they match, for a counter to be added to (count++), and if they don't match, then for the URL and the count value to be added to a new array, assuming that this is the end of the duplicates.

This would remove duplicates and give me a count of the number of occurrences for each URL. At the moment, what I have is not working, and I do not know why (like I say, I am not very experienced with it all).

With the code below, assume that a JSON feed has been searched for using a keyword into srchResponse.results. The results with URLs in them get added to sList, a string List type, which contains only the URLs, not the message as a whole.

I want to put one of each URL (no duplicates), a count integer (to string) for the number of occurrences of a URL, and the username, message, and user image URL all into my jagged array called 'urls[100][]'. I have made the array 100 rows long to make sure everything can fit but generally, this is too big. Each 'row' will have 5 elements in them.

The debugger gets stuck on the line: if (sList[i] == sList[i + 1]) which is the crux of my idea, so clearly the logic is not working. Any suggestions or anything will be seriously appreciated!

Here is sample code:

  var sList = new ArrayList();

    string[][] urls = new string[100][];

    int ctr = 0;
    int j = 1;

    foreach (Result res in srchResponse.results)
    {           

        string content = res.text;
        string pattern = @"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
        MatchCollection matches = Regex.Matches(content, pattern);

      foreach (Match match in matches)
      {

        GroupCollection groups = match.Groups;

                    sList.Add(groups[0].Value.ToString());
      }
    }

    sList.Sort();    
    foreach (Result res in srchResponse.results)
    {
        for (int i = 0; i < 100; i++)
        {
            if (sList[i] == sList[i + 1])
            {
                j++;
            }
            else
            {
                urls[ctr][0] = sList[i].ToString();
                urls[ctr][1] = j.ToString();
                urls[ctr][2] = res.text;
                urls[ctr][3] = res.from_user;
                urls[ctr][4] = res.profile_image_url;
                ctr++;
                j = 1;
            }
        }



    }

The code then goes on to add each result into a StringBuilder method with the HTML.

Is now edite

+1  A: 

I'd recommend using a more sophisticated data structure than an array. A Set will guarantee that you have no duplicates.

Looks like C# collections doesn't include a Set, but there are 3rd party implementations available, like this one.

duffymo
@duffymo - there is a set structure and it's called `HashSet<>` and it was introduced in .NET 3.5. They couldn't call it Set because that could conflict with the `Set` keyword in Visual Basic.
John Rasch
Thank you, John. Couldn't a namespace have sorted that issue out? I thought that was what it was for. Is there a TreeSet or any other implementation as well?
duffymo
@duffymo - In VB `Set` is a keyword (just like `for` or `if`) so a namespace change wouldn't help. I believe at this point there are no other implementations since there is no interface for a set, but the structure itself isn't sealed so there could be more. Looks as though in .NET 4.0 they're creating an `ISet<>` interface which will allow for other implementations though.
John Rasch
Thanks for the instruction, John.
duffymo
+6  A: 

The description of your algorithm seems fine. I don't know what's wrong with the implementation; I haven't read it that carefully. (The fact that you are using an ArrayList is an immediate red flag; why aren't you using a more strongly typed generic collection?)

However, I have a suggestion. This is exactly the sort of problem that LINQ was intended to solve. Instead of writing all that error-prone code yourself, just describe the transformation you're interested in, and let the compiler work it out for you.

Suppose you have a list of strings and you wish to determine the number of occurrences of each:

var notes = new []{ "Do", "Fa", "La", "So", "Mi", "Do", "Re" };

var counts = from note in notes 
             group note by note into g
             select new { Note = g.Key, Count = g.Count() }

foreach(var count in counts)
    Console.WriteLine("Note {0} occurs {1} times.", count.Note, count.Count);

Which I hope you agree is much easier to read than all that array logic you wrote. And of course, now you have your sequence of unique items; you have a sequence of counts, and each count contains a unique Note.

Eric Lippert
Very interesting and helpful... LINQ I realise now is my way out of this problem. What would be a more strongly typed generic collection? You mean a collection I define myself (a collection of objects that uses internal properties for comparison?)Again thanks Eric, nice stuff.
AlexW
Looks like I missed the notification of other answers being posted. I'll delete mine in lieu of this response...
John Rasch
@AlexW - In this case `List<string>` would be the strongly typed generic collection you need because the list of URLs will explicitly contain strings and only strings (there's obviously more to generics than this, but this is a simple explanation). `ArrayList.Add()` takes `object` types, which could be anything that inherits from `object` (which happens to be everything in .NET)!. Good tutorial: http://www.c-sharpcorner.com/UploadFile/jgodel/Page102062006170216PM/Page1.aspx
John Rasch
A: 

Your loop fails because when i == 99, (i + 1) == 100 which is outside the bounds of your array.

But as other have pointed out, .Net 3.5 has ways of doing what you want more elegantly.

Matt Ellen
I did try using lower values actually, e.g. i < 40 and it still didn't work!
AlexW
What happens then is that it says urls[ctr][1] = j.ToString(); has an error: Null Reference Exception...???I am still confused! I'll try using LINQ and re-writing the whole thing
AlexW
Maybe if I override the ToString method, this will work for my integer j = 1;
AlexW
A: 

If you don't need to know how many duplicates a specific entry has you could do the following:

LINQ Extension Methods

.Count()   
.Distinct()  
.Count()  
citronas