tags:

views:

325

answers:

2

Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc.

What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way?

A simple example or a link to more information would be much appreciated.

A: 

I've just had a quick look at the site you mentioned - it appears to fetch info for one domain, rather than crawl for urls.

Anyway, you would write a script which gets a url from a queue, fetches the page contents, parses out the urls within and adds these to the queue. Then add a starting url to the queue and run the script as a crontab.

adam
Well, I think the site does use a robot because of their page here: http://websiteoutlook.com/remove_url.html
Chris
A: 

Around 4 million unique URLs can be found at DMOZ.org. It is allowed to crawl over the catalogue with a frequency of no more than 1 page per second. As a crawler you can use a site downloading software like HTTrack (it supports an option of complying with robots.txt rules). All you have to do is to parse downloaded pages for URLs then (and to properly attribute the site afterwards).

Webmezha