views:

242

answers:

3

hello,

I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data?

One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I would have to scrape the search results. Google also supports site search but doesn't offer an easy way to download the data.

Can you think of a better way that would work with most (if not all) websites?

thanks, Richard

+1  A: 

Some webmasters offer Sitemaps, which are essentially XML lists of every URL on the domain. However, there is no general solution except crawling. If you do use a crawler, please obey robots.txt.

Matthew Flaschen
unfortunately most sites I've looked at don't use them. I'm hoping to make use of the results of another crawler instead of crawling again myself.
Plumo
I have to disagree that there is at least one general solution, which I explained is using the crawled results from a search engine. This is done using site:foo.org.
Plumo
Richard, search engines do not index every domain, and their listings do not include every page on the domains they do index. That's why site:foo.org is not a general solution.
Matthew Flaschen
A: 

You can download a list of up to 500 URLs free through this online tool:

XML Sitemap Generator

...Just select "text list" after the tool crawls your site.

A: 

Seems there is no royal way to web crawling, so I will just stick to my current approach...

Also I found most search engines only expose the first 1000 results anyway.

Plumo