ansaurus

Question

How to fix my crawler in C#?

Answer 1

A:

Substring the following line http.Match(m.ToString()) to http.Match(m.ToString().remove(m.ToString().IndexOf("\"")))

Not the cleanest way to do it but it works

RC1140 2010-02-26 13:25:16

Answer 2

A:

Change the closing tag to be "

JamesB 2010-02-26 13:26:15

Answer 3

+10 A:

Change your regex to:

Regex http = new Regex("http:.*?\"");

Or even better, parse all links using HtmlAgilityPack and Xpath:

var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");

var nodes = doc.DocumentNode.SelectNodes("//a[@href]"); // Will find all links

foreach (var node in nodes)
{
    Console.WriteLine(node.InnerHtml);
}

alexn 2010-02-26 13:29:16

+1 for suggesting a non-regex solution. Parsing real-world HTML can be tricky and I consider regex a 90% solution. If I need something more bulletproof than that I use a tool designed for the task at hand.

Seth Petry-Johnson 2010-02-26 13:33:39

Parsing HTML with regex is bad. Unless you need the performance, go for something that can do the job properly

spender 2010-02-26 13:37:45

+1 for HtmlAgilityPack

šljaker 2010-02-26 15:17:58

Answer 4

+2 A:

The quick solution is to change this:

Regex http = new Regex("http:.*?>");

To this:

Regex http = new Regex("http:.*?\"");

The better solution is to use a library to parse the html - the HTML Agility Pack can be used for that and will make your life easier.

Oded 2010-02-26 13:29:58

Answer 5

A:

Regex hrefs = new Regex("<a href.*?>");
Regex http = new Regex("(http:.*?)\"");  
StringBuilder sb = new StringBuilder();
WebClient client = new WebClient();
string source = client.DownloadString("http://google.com");
foreach (Match m in hrefs.Matches(source))
{
    var value = http.Match(m.ToString()).Groups[1].Value;
    sb.Append(value);
    Console.WriteLine(value);
}

Darin Dimitrov 2010-02-26 13:31:03

Answer 6

A:

A nice and simple solution. Match any character after http: except " character

"http:[^\"]*"

Fadrian Sudaman 2010-02-26 13:55:49

ansaurus

tags:

views:

answers:

How to fix my crawler in C#?

related questions