tags:

views:

374

answers:

4

I've taken inspiration from the example show in the following URL csharp-online and intended to retrieve all the URLs from this page alexa

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Text.RegularExpressions;
namespace ExtractingUrls
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient client = new WebClient();
            const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
            string source = client.DownloadString(url);
            //Console.WriteLine(Getvals(source));
            string matchPattern =
                    @"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?<url>[^""^']+[.]*)[""'].class=""offsite"".*>(?<name>[^<]+[.]*)</a>";
            foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true))
            {
                foreach (DictionaryEntry DE in grouping)
                {
                    Console.WriteLine("Value = " + DE.Value);
                    Console.WriteLine("");
                }
            }
            // End.
            Console.ReadLine();
        }
        public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch)
        {
            ArrayList keyedMatches = new ArrayList();
            int startingElement = 1;
            if (wantInitialMatch)
            {
                startingElement = 0;
            }
            Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
            MatchCollection theMatches = RE.Matches(source);
            foreach (Match m in theMatches)
            {
                Hashtable groupings = new Hashtable();
                for (int counter = startingElement; counter < m.Groups.Count; counter++)
                {
                    // If we had just returned the MatchCollection directly, the
                    // GroupNameFromNumber method would not be available to use
                    groupings.Add(RE.GroupNameFromNumber(counter),
                    m.Groups[counter]);
                }
                keyedMatches.Add(groupings);
            }
            return (keyedMatches);
        }
    }
}

But here I face a problem, when I'm executing each URL is being displayed thrice, That's first the whole anchor tag is getting displayed, next the URL is being displayed twice. can anyone suggest me where should I correct so that I can have each URL displayed exactly once.

+3  A: 

Use HTML Agility Pack to parse HTML. I think it will make your problem much easier to solve.

Here's one way to do it:

WebClient client = new WebClient();
string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
string source = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']"))
{
    Console.WriteLine(link.Attributes["href"].Value);
}
Mark Byers
+1  A: 

in your regex, you have two groupings, and the entire match. If I'm reading it correctly, you should only want the URL portion of the matches, which is the second of the 3 groupings....

instead of this:

for (int counter = startingElement; counter < m.Groups.Count; counter++)
            {
                // If we had just returned the MatchCollection directly, the
                // GroupNameFromNumber method would not be available to use
                groupings.Add(RE.GroupNameFromNumber(counter),
                m.Groups[counter]);
            }

don't you want this?:

groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);
Mike Sherov
+1  A: 
int startingElement = 1;
if (wantInitialMatch)
{
startingElement = 0;
}

...

for (int counter = startingElement; counter < m.Groups.Count; counter++)
{
// If we had just returned the MatchCollection directly, the
// GroupNameFromNumber method would not be available to use
    groupings.Add(RE.GroupNameFromNumber(counter),
    .Groups[counter]);
}

Your passing wantInitialMatch = true, so your for loop is returning:

.Groups[0] //entire match
.Groups[1] //(?<url>[^""^']+[.]*) href part
.Groups[2] //(?<name>[^<]+[.]*) link text
Paul Creasey
Thank you paul, now I understood where did I went wrong.
Chaitanya