How can I gather all links on a site without content? | ansaurus

tags:

views:

24

answers:

1

+1 Q:

How can I gather all links on a site without content?

I would like to get all URLs a site links to (on the same domain) without downloading all of the content with something like wget. Is there a way to tell wget to just list the links it WOULD download?

For a little background of what I'm using this for if someone can come up with a better solution: I'm trying to build a robots.txt file that excludes all files that end with p[4-9].html but robots.txt doesn't support regular expressions. So I'm trying to get all links and then run a regular expression against them then put the result in the robots.txt. Any ideas?

A:

My recommendation: combine wget and gawk in a (very) small shell script.

There's a good overview of AWK on wikipedia: http://en.wikipedia.org/wiki/AWK

Nick 2010-08-04 13:19:38

related questions

Get a list of URLs from a site

is there a good web crawler library available for PHP or Ruby?

Website Spidering Auto Detection

Looping through DirectoryEntry or any object hierarchy - C#

Looking for a Spider ActiveX Control (for vb6)

Robots.txt: allow only major SE

How to execute a PHP spider/scraper but without it timing out

Extracting meaning full content from web pages

Processing web feed multiple times a day

What do I do if a search engine spider is hammering my site?

Creating a simple 'spider'

Anyone know of a good Python based web crawler that I could use?

How would someone download a website from Google Cache?

How to find "equivalent" texts?

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Quickest way to get list of <title> values from all pages on localhost website

Recommendations for a spidering tool to use with Lucene or Solr?

What is the current level of XHTML support in browsers and search engine spiders?

Detecting 'stealth' web-crawlers

SEO for Ultraseek 5.7

Automated screenshots?

How to write a crawler?

best library to do web-scraping

Tools to convert asp.net dynamic site into static site

What are the key considerations when creating a web crawler?