Getting the refering page from wget when recursively searching. | ansaurus

tags:

views:

55

answers:

0

Q:

Getting the refering page from wget when recursively searching.

I'm trying to find any dead links on a website using wget. I'm running:

wget -r -l20 -erobots=off --spider -S http://www.example.com

which recursively checks to make sure each link on the page exists and retrieves the headers. I am then parsing the output with a simple script.

I would like to know which page wget retrieved a given link from however the only information given by wget are the page it's requesting, the header and a time stamp (and some other stuff I don't care about). That information is enough to know that a dead link exists but it doesn't let me know what page the dead link is located on.

Is there anyway to set wget to output that information (short of having it actually download the entire site)?

related questions

Web crawlers and Google App Engine Hosted applications

Detecting CacheBuster querystrings when crawling a page

How to prevent robots.txt passing from staging env to production?

Detecting honest web crawlers

How to force a page to be removed from the search engine index?

How to best develop web crawlers

robots.txt: disallow all but a select few, why not?

Asp.net Request.Browser.Crawler - Dynamic Crawler List?

What are the best prebuilt libraries for doing Web Crawling in Python

Anyone know of a good Python based web crawler that I could use?

Web crawler links/page logic in PHP

Crawler/parser for Xapian

Protect Email on Web Site From Robots and Crawlers

Recommendations for a spidering tool to use with Lucene or Solr?

Detecting 'stealth' web-crawlers

Can I block search crawlers for every site on an Apache web server?

HttpBrowserCapabilities.Crawler property .NET

Prevent site data from being crawled and ripped

What's a good Web Crawler tool

Building a web crawler - using Webkit packages

Is there a .NET equivalent of Perl's LWP / WWW::Mechanize?

How do you turn a dynamic site into a static site that can be demo'd from a CD?

keep rsync from removing unfinished source files

How to set up a robot.txt which only allows the default page of a site

What are the key considerations when creating a web crawler?