hello,
I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data?
One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I would have to scrape the search results. Google also supports site search but doesn't offer an easy way to download the data.
Can you think of a better way that would work with most (if not all) websites?
thanks, Richard