views:

34

answers:

2

I am trying to capture urls in an html page that is being repeated and it usually works when the urls are on different lines but in this case they appear all in one line and separate lines. The url has the tags: Here is what I have been trying

Dim regex As Regex = New Regex( _
                            ".*<a.*href='http://(?&lt;Link&gt;.*?)/profile'&gt;", _
                            RegexOptions.IgnoreCase _
                            Or RegexOptions.CultureInvariant _
                            Or RegexOptions.IgnorePatternWhitespace _
                            Or RegexOptions.Compiled _
                            )


            Dim ms As MatchCollection = regex.Matches(_html)
            Dim url As String = String.Empty
            For Each m As Match In ms
                url = m.Groups("Link").Value.ToLower

Any ideas appreciated.

+2  A: 

There is no need to use Regex to try to parse HTML when there is the fantastic library called HTML Agility Pack. This library makes light work of finding the links and it will correctly handle special cases where your regular expression will fail. You will get a more robust solution with less effort involved.

This example code demonstrating use of the library is written in C#, but hopefully it will help you to build a solution in VB.NET:

HtmlDocument doc  = new HtmlDocument();
doc.Load("input.html");
foreach (var link in doc.DocumentNode.Descendants("a"))
{
    string href = link.Attributes["href"].Value;
    Match match = Regex.Match(href, "^http://(?&lt;Link&gt;.*?)/profile$");
    if (match.Success)
    {
        Console.WriteLine(match.Groups["Link"].Value);
    }
}
Mark Byers
Thank you very much for your response, will look into applying this in my future programs instead of regex
vbNewbie
+1  A: 

You may need to add RegexOptions.SingleLine. From the docs:

Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

Adam Ruth