views:

58

answers:

4

I have directory with > 1000 .html files, and would like to check all of them for bad links - preferably using console. Any tool you can recommend for such task?

+1  A: 

You can extract links from html files using Lynx text browser. Bash scripting around this should not be difficult.

mouviciel
Lynx can do it, but it doesn't really support it. wget is much better suited for the purpose.
reinierpost
How do you get wget to output a list of links in a page?
David Dorward
It's a really cool idea. Why didn't I thought of it earlier?
depesz
+3  A: 

I'd use checklink (a W3C project)

David Dorward
As long as you are careful to set the user agent and accept headers (to avoid bogus error codes from bot detectors) this should work.
Tim Post
It would look ok, but it's definitely not intended for such large projects - it doesn't have any way to just list broken links, and output for my project is *really* big.
depesz
A: 

Try the webgrep command line tools or, if you're comfortable with Perl, the HTML::TagReader module by the same author.

gareth_bowles
+1  A: 

you can use wget, eg

wget -r --spider  -o output.log http://somedomain.com

at the bottom of the output.log file, it will indicate whether wget has found broken links. you can parse that using awk/grep

ghostdog74