ansaurus

Question

Extracting URLs using regex in .NET

Answer 1

A:

take a look of this: http://bouncetadiss.blogspot.com/2008/02/parsing-uri-url-in-c-and-vbnet.html

serhio 2010-01-31 23:40:04

Answer 2

+3 A:

Use HTML Agility Pack to parse HTML. I think it will make your problem much easier to solve.

Here's one way to do it:

WebClient client = new WebClient();
string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
string source = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']"))
{
    Console.WriteLine(link.Attributes["href"].Value);
}

Mark Byers 2010-01-31 23:43:19

Answer 3

+1 A:

in your regex, you have two groupings, and the entire match. If I'm reading it correctly, you should only want the URL portion of the matches, which is the second of the 3 groupings....

instead of this:

for (int counter = startingElement; counter < m.Groups.Count; counter++)
            {
                // If we had just returned the MatchCollection directly, the
                // GroupNameFromNumber method would not be available to use
                groupings.Add(RE.GroupNameFromNumber(counter),
                m.Groups[counter]);
            }

don't you want this?:

groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);

Mike Sherov 2010-01-31 23:48:50

Answer 4

+1 A:

int startingElement = 1;
if (wantInitialMatch)
{
startingElement = 0;
}

...

for (int counter = startingElement; counter < m.Groups.Count; counter++)
{
// If we had just returned the MatchCollection directly, the
// GroupNameFromNumber method would not be available to use
    groupings.Add(RE.GroupNameFromNumber(counter),
    .Groups[counter]);
}

Your passing wantInitialMatch = true, so your for loop is returning:

.Groups[0] //entire match
.Groups[1] //(?<url>[^""^']+[.]*) href part
.Groups[2] //(?<name>[^<]+[.]*) link text

Paul Creasey 2010-01-31 23:50:55

Thank you paul, now I understood where did I went wrong.

Chaitanya 2010-01-31 23:54:23

ansaurus

tags:

views:

answers:

Extracting URLs using regex in .NET

related questions