tags:

views:

908

answers:

8

What techniques or tools are recommended for finding broken links on a website?

I have access to the logfiles, so could conceivably parse these looking for 404 errors, but would like something automated which will follow (or attempt to follow) all links on a site.

+6  A: 

See LinkChecker for Firefox.

For Mac OS there is a tool Integrity which can check URLs for broken links.

For Windows there is Xenu's Link Sleuth.

jrudolph
Xenu's Link Sleuth looks just the ticket.
Ian Nelson
A: 

Hi Ian,

Best way is to create a small bot that runs over your entire site, and records the outcome. I did this to test my sites before deployment and it works really well.

Nick Berardi
@nick, can you please give me the small bot or the algorithm which I can try by myself for my site. thanks in advance
harigm
+2  A: 

Either use a tool that parses your log files and gives you a 'broken links' report (e.g. Analog or Google Webmaster Tools), or run a tool that spiders your web site and reports broken links (e.g. W3C Link Checker).

Peter Hilton
+2  A: 

I like the W3C Link Checker.

Paul Reiners
A: 

See linkcheker tool

A: 

Your best bet is to knock together your own spider in your scripting language of choice, it could be done recursively along the lines of:

// Pseudo-code to recursively check for broken links
// logging all errors centrally
function check_links($page)
{
    $html = fetch_page($page);
    if(!$html)
    {
        // Log page to failures log
        ...
    }
    else
    {
        // Find all html, img, etc links on page
        $links = find_links_on_page($html);
        foreach($links as $link)
        {
            check_links($link);
        }
    }
}

Once your site has gotten a certain level of attention from Google, their webmaster tools are invaluable in showing broken links that users may come across, but this is quite reactionary - the dead links may be around for several weeks before google indexes them and logs the 404 in your webmaster panel.

Writing your own script like above will show you all possible broken links, without having to wait for google (webmaster tool) or your users (404 in access logs) to stumble across them.

ConroyP
I no longer have time for such intellectual challenges (fun though it sounds), and was kinda hoping someone might have written a such a spider already! :-)
Ian Nelson
If someone ever writes a handy pseudo-code -> PHP/Perl converter, then we'd be in business!
ConroyP
A: 

There's a windows app called CheckWeb. Its no longer developed, but it works well, and the code is open (C++ I believe).

You just give it a url, and it will crawl your site (and external links if you choose), reporting any errors, image / page "weight" etc.

http://www.algonet.se/~hubbabub/how-to/checkweben.html

scunliffe
+1  A: 

In a .NET application you can set IIS to pass all requests to ASP.NET and then in your global error handler you can catch and log 404 errors. This is something you'd do in addition to spidering your site to check for internal missing links. Doing this can help find broken links from OTHER sites and you can then fix them with 301 redirects to the correct page.

To help test your site internally there's also the Microsoft SEO toolkit.

Of course the best technique is to avoid the problem at compile time! In ASP.NET you can get close to this by requiring that all links be generated from static methods on each page so there's only ever one location where any given URL is generated. e.g. http://www.codeproject.com/KB/aspnet/StronglyTypedPages.aspx

If you want a complete C# crawler, there's one here:- http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

Hightechrider