ansaurus

Question

How to find all links / pages on a website

Answer 1

+1 A:

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

Hank Gay 2009-09-17 14:51:42

thank you so much Hank! Perfect - exactly what I needed. Very much appreciated.

Jonathan Lyon 2009-09-17 15:08:05

Answer 2

A:

*NOTE:Just Google => simple web crawler in language_name*

TheMachineCharmer 2009-09-17 14:52:56

Answer 3

A:

If this is a programming question, then I would suggest you write your own regular expression to parse all the retrieved contents. Target tags are IMG and A for standard HTML. For JAVA,

final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";

this along with Pattern and Matcher classes should detect the beginning of the tags. Add LINK tag if you also want CSS.

However, it is not as easy as you may have intially thought. Many web pages are not well-formed. Extracting all the links programmatically that human being can "recognize" is really difficult if you need to take into account all the irregular expressions.

Good luck!

mizubasho 2009-09-17 15:17:47

ansaurus

tags:

views:

answers:

How to find all links / pages on a website

related questions