views:

24

answers:

1

I would like to get all URLs a site links to (on the same domain) without downloading all of the content with something like wget. Is there a way to tell wget to just list the links it WOULD download?

For a little background of what I'm using this for if someone can come up with a better solution: I'm trying to build a robots.txt file that excludes all files that end with p[4-9].html but robots.txt doesn't support regular expressions. So I'm trying to get all links and then run a regular expression against them then put the result in the robots.txt. Any ideas?

A: 

My recommendation: combine wget and gawk in a (very) small shell script.

There's a good overview of AWK on wikipedia: http://en.wikipedia.org/wiki/AWK

Nick