How to get list of URLs for a domain

views:

242

answers:

How to get list of URLs for a domain

hello,

I would like to generate a list of URLs for a domain but I would rather save bandwidth by not crawling the domain myself. So is there a way to use existing crawled data?

One solution I thought of would be to do a Yahoo site search, which lets me download the first 1000 results in TSV format. However to get all the records I would have to scrape the search results. Google also supports site search but doesn't offer an easy way to download the data.

Can you think of a better way that would work with most (if not all) websites?

thanks, Richard

+1 A:

Some webmasters offer Sitemaps, which are essentially XML lists of every URL on the domain. However, there is no general solution except crawling. If you do use a crawler, please obey robots.txt.

Matthew Flaschen 2009-06-28 05:25:38

unfortunately most sites I've looked at don't use them. I'm hoping to make use of the results of another crawler instead of crawling again myself.

Plumo 2009-06-28 09:22:26

I have to disagree that there is at least one general solution, which I explained is using the crawled results from a search engine. This is done using site:foo.org.

Plumo 2009-06-28 09:28:21

Richard, search engines do not index every domain, and their listings do not include every page on the domains they do index. That's why site:foo.org is not a general solution.

Matthew Flaschen 2009-06-28 09:37:20

You can download a list of up to 500 URLs free through this online tool:

XML Sitemap Generator

...Just select "text list" after the tool crawls your site.

2009-08-23 04:29:55

Seems there is no royal way to web crawling, so I will just stick to my current approach...

Also I found most search engines only expose the first 1000 results anyway.

Plumo 2009-10-05 02:59:27

ansaurus

tags:

views:

answers:

How to get list of URLs for a domain

related questions