tags:

views:

54

answers:

2

I am using web client class to HTML data from a web page. Now I want to get the complete href tags and there titles from the HTML data. Initially I used loops, Felling inefficient I switched to regExp, but dint got efficient solution.

He is the initial code:

for (int i = 0; i < htmldata.Length - 5; i++)
{
  if (htmldata.Substring(i, 5) == "href=")
  {
    n1 = htmldata.Substring(i + 6, htmldata.Length - (i + 6)).IndexOf("\"");
    Sublink = htmldata.Substring(i + 6, n1);
    var absoluteUri = new Uri(baseUri, temp);
    n2 = htmldata.Substring(i + n1 + 1, htmldata.Length - (i + n1 + 1)).IndexOf("<");
    subtitle = htmldata.Substring(i + 6 + n1 + 2, n2 - 7); 
  }
}

This code is getting some of the links like this.

/l.href.replace(new RegExp(

/advanced_search?hl=en&q=&hl=en&

and titles like this

onclick=gbar.qs(this) class=gb2>Photos

")+"q="+encodeURIComponent(b)})}i.qs=n;function o(a,b,d,c,f,e){var g=document.getElementById(a);if(g){var 

Which are absolutely invalid. Please suggest me the correct code for getting valid relative href links and titles.

+1  A: 

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

wRAR
He is not trying to parse with regular expressions. He is using substrings and indexes.
Oded
+1  A: 

Use the HTML Agility pack to parse the HTML for you, then you can use XPath expressions to select all links in the page and associated data.

Trying to parse out HTML by yourself is error prone and brittle, as you have already discovered.

Oded