views:

661

answers:

1

I'm currently using the HTML Agility Pack in C# for a web crawler. I've managed to avoid many issues so far (Invalid URIs, such as "/extra/url/to/base.html" and "#" links), but I also need to process PHP, Javascript, etc. Like for some sites, the links are in PHP, and when my web crawler tries to navigate to these, it fails. One example is a PHP/Javascript accordion link page. How would I go about navigating/parsing these links?

+4  A: 

Lets see if I understood your question correctly. I'm aware that this answer is probably inadequate but if you need a more specific answer I'd need more details.


You're trying to program a web crawler but it cannot crawl URLs that end with .php?

If that's the case you need to take a step back and think about why that is. It could be because the crawler chooses which URLs to crawl using a regex based on an URI scheme.

In most cases these URLs are just normal HTML but they could also be a generated image (like a captcha) or a download link for a 700mb iso file - and there's no way to know be certain without checking out the header of the HTTP response from that URL.

Note: If you're writing your own crawler from scratch you're going to need good understanding of HTTP.

The first thing your crawler is going to see when gets an URL is the header, which contains a MIME content-type - it tells a browser/crawler how to process and open the data (is it HTML, normal text, .exe, etc). You'll probably want to download pages based on the MIME type instead of an URL scheme. The MIME type for HTML is text/html and you should check for that using the HTTP library you're using before downloading the rest of the content of an URL.


The Javascript problem

Same as above except that running javascript in the crawler/parser is pretty uncommon for simple projects and might create more problems than it solves. Why do you need Javascript?


A different solution
If you're willing to learn Python (or already know it) I suggest you look at Scrapy. It's a web crawling framework built similarly to the Django web framework. It's really easy to use and a lot of problems have already been solved so it could be a good starting point if you're trying to learn more about the technology.

Hannson
Thanks! Just what I needed. Sometimes I don't even know what questions to ask!
cam
Glad I could help!
Hannson