ansaurus

Question

How do you parse a web page and extract all the href links?

Answer 1

+5 A:

A quick google search turned up a nice looking possibility, TagSoup.

William Keller 2008-09-19 03:28:55

This site provides a complete example with TagSoup that works.http://www.cyblex.at/blog/?p=83I had to change some of the quote marks (' and ") to get it to run but this example is excellent. The author downloads all the *.mp4 files.

2008-10-03 19:49:08

Answer 2

A:

depends which languages you know... In Java I use Apache common's HTTP Parser (along with their HTTPClient).

I'm sure that there is a widely used HTML parser for this in any language that you are developing in.

Zombies 2008-09-19 03:31:06

Answer 3

A:

Try a regular expression. Something like this should work:

(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text -> 
    // do something with url and text
}

Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.

J D OConal 2008-09-19 03:41:42

Regular Expressions also cure cancer.

wfarr 2008-09-19 03:50:25

Answer 4

+1 A:

Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.

Peter Kelley 2008-09-19 03:52:34

Answer 5

+2 A:

I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.

It is also easier to write and to read.

<html>
   <body>
      <a href="1.html">1</a>
      <a href="2.html">2</a>
      <a href="3.html">3</a>
   </body>
</html>

With the html above, this expression "/html/body/a" will list all href elements.

Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html

Anonymous 2008-09-19 07:14:56

Answer 6

+2 A:

Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.

input = """<html><body>
<a href = "http://www.hjsoft.com/"&gt;John&lt;/a&gt;
<a href = "http://www.google.com/"&gt;Google&lt;/a&gt;
<a href = "http://www.stackoverflow.com/"&gt;StackOverflow&lt;/a&gt;
</body></html>"""

doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
    println "${it.text()}, ${[email protected]()}"
}

John Flinchbaugh 2008-10-02 18:18:55

Answer 7

A:

Html parser + Regular expressions Any language would do it, though I'd say Perl is the fastest solution.

Prog 2008-10-02 18:34:04

ansaurus

tags:

views:

answers:

How do you parse a web page and extract all the href links?

related questions