I have directory with > 1000 .html files, and would like to check all of them for bad links - preferably using console. Any tool you can recommend for such task?
+1
A:
You can extract links from html files using Lynx text browser. Bash scripting around this should not be difficult.
mouviciel
2010-03-15 10:14:52
Lynx can do it, but it doesn't really support it. wget is much better suited for the purpose.
reinierpost
2010-03-15 11:18:06
How do you get wget to output a list of links in a page?
David Dorward
2010-03-15 11:27:57
It's a really cool idea. Why didn't I thought of it earlier?
depesz
2010-03-15 13:14:30
As long as you are careful to set the user agent and accept headers (to avoid bogus error codes from bot detectors) this should work.
Tim Post
2010-03-15 11:41:30
It would look ok, but it's definitely not intended for such large projects - it doesn't have any way to just list broken links, and output for my project is *really* big.
depesz
2010-03-15 13:25:15
A:
Try the webgrep command line tools or, if you're comfortable with Perl, the HTML::TagReader module by the same author.
gareth_bowles
2010-03-15 15:55:09
+1
A:
you can use wget
, eg
wget -r --spider -o output.log http://somedomain.com
at the bottom of the output.log file, it will indicate whether wget
has found broken links. you can parse that using awk/grep
ghostdog74
2010-03-15 16:04:02