tags:

views:

399

answers:

6
Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("http:.*?>");  
StringBuilder sb = new StringBuilder();
WebClient client = new WebClient();
string source = client.DownloadString("http://google.com");
foreach (Match m in hrefs.Matches(source)){
sb.Append(http.Match(m.ToString()));
Console.WriteLine(http.Match(m.ToString()));
}

The codes works fine, but just once problem Look at the output.

http://images.google.se/imghp?hl=sv&amp;tab=wi" onclick=gbar.qs(this) class=gb1>
http://video.google.se/?hl=sv&amp;tab=wv" onclick=gbar.qs(this) class=gb1>
http://maps.google.se/maps?hl=sv&amp;tab=wl" onclick=gbar.qs(this) class=gb1>
http://news.google.se/nwshp?hl=sv&amp;tab=wn" onclick=gbar.qs(this) class=gb1>
http://translate.google.se/?hl=sv&amp;tab=wT" onclick=gbar.qs(this) class=gb1>
http://mail.google.com/mail/?hl=sv&amp;tab=wm" class=gb1>
http://www.google.se/intl/sv/options/" onclick="this.blur();gbar.tg(event);return !1" aria-haspopup=true class=gb3>
http://blogsearch.google.se/?hl=sv&amp;tab=wb" onclick=gbar.qs(this) class=gb2>
http://www.youtube.com/?hl=sv&amp;tab=w1&amp;gl=SE" onclick=gbar.qs(this) class=gb2>
http://www.google.com/calendar/render?hl=sv&amp;tab=wc" class=gb2>
http://picasaweb.google.se/home?hl=sv&amp;tab=wq" onclick=gbar.qs(this) class=gb2>
http://docs.google.com/?hl=sv&amp;tab=wo" class=gb2>
http://www.google.se/reader/view/?hl=sv&amp;tab=wy" class=gb2>
http://sites.google.com/?hl=sv&amp;tab=w3" class=gb2>
http://groups.google.se/grphp?hl=sv&amp;tab=wg" onclick=gbar.qs(this) class=gb2>
http://www.google.se/ig%3Fhl%3Dsv%26source%3Diglk&amp;usg=AFQjCNEsLWK4azJkUc3KrW46JTUSjK4vhA" class=gb4>
http://www.google.se/" class=gb4>
http://www.google.com/intl/sv/landing/games10/index.html"&gt;
http://www.google.com/ncr"&gt;

How can i remove the html tags?

A: 

Substring the following line http.Match(m.ToString()) to http.Match(m.ToString().remove(m.ToString().IndexOf("\"")))

Not the cleanest way to do it but it works

RC1140
A: 

Change the closing tag to be "

JamesB
+10  A: 

Change your regex to:

Regex http = new Regex("http:.*?\"");

Or even better, parse all links using HtmlAgilityPack and Xpath:

var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");

var nodes = doc.DocumentNode.SelectNodes("//a[@href]"); // Will find all links

foreach (var node in nodes)
{
    Console.WriteLine(node.InnerHtml);
}
alexn
+1 for suggesting a non-regex solution. Parsing real-world HTML can be tricky and I consider regex a 90% solution. If I need something more bulletproof than that I use a tool designed for the task at hand.
Seth Petry-Johnson
Parsing HTML with regex is bad. Unless you need the performance, go for something that can do the job properly
spender
+1 for HtmlAgilityPack
šljaker
+2  A: 

The quick solution is to change this:

Regex http = new Regex("http:.*?>");

To this:

Regex http = new Regex("http:.*?\"");

The better solution is to use a library to parse the html - the HTML Agility Pack can be used for that and will make your life easier.

Oded
A: 
Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("(http:.*?)\"");  
StringBuilder sb = new StringBuilder();
WebClient client = new WebClient();
string source = client.DownloadString("http://google.com");
foreach (Match m in hrefs.Matches(source))
{
    var value = http.Match(m.ToString()).Groups[1].Value;
    sb.Append(value);
    Console.WriteLine(value);
}
Darin Dimitrov
A: 

A nice and simple solution. Match any character after http: except " character

"http:[^\"]*"
Fadrian Sudaman