views:

559

answers:

5

I know the Google Search Appliance has access to this information (as this factors into the PageRank Algorithm), but is there a way to export this information from the crawler appliance?

External tools won't work because a significant portion of the content is for a corporate intranet.

+3  A: 

Might be something available on Google, but I have never checked. I usually use the link checker provided by W3C. The W3C one can also detect redirects which is useful if your server handles 404s by redirecting instead of returning a 404 status code.

regex
+1  A: 

You can use Google Webmaster Tools to view, among other things, broken links on your site.

This won't show you broken links to external sites though.

Greg
A: 

Why not just analyze your webserver logs and look for all the 404 pages? That makes far more sense and is much more reliable.

TravisO
We're moving around 700,000 pages to a new CMS system and the server logs only catch the ones people are actively clicking on.
Chris Ballance
The logs will also show 404s from clients such as the GSA crawler. If the GSA has detected a given URL then so has the server.
Liam
A: 

It seems that this is not possible. Under Status and Reports > Crawl Diagnostics there are 2 styles of report available: the directory drill-down 'Tree View' and the 100 URLs at a time 'List View'. Some people have tried creating programs to page through the List View but this seems to fail after a few thousand URLs.

My advice is to use your server logs instead. Make sure that 404 and referrer URL logging are enabled on your web server, since you will probably want to correct the page containing the broken link.

You could then use a log file analyser to generate a broken link report.

To create an effective, long-term way of monitoring your broken links, you may want to set up a cron job to do the following:

  • Use grep to extract lines containing 404 entries from the server log file.
  • Use sed to remove everything except requested URLs and referrer URLs from every line.
  • Use sort and uniq commands to remove duplicates from the list.
  • Output the result to a new file each time so that you can monitor changes over time.
Liam
+1  A: 

A free tool called Xenu turned out the be the weapon of choice for this task. http://home.snafu.de/tilman/xenulink.html#Download

Chris Ballance