I want to parse a web page in Groovy and extract all of the href links and the associated text with it.

If the page contained these links:

<a href="">Google&lt;/a>
<a href="">Apple&lt;/a>

The output would be:

I'm looking for a Groovy answer. AKA. The easy way!

A quick google search turned up a nice looking possibility, TagSoup.

This site provides a complete example with TagSoup that works. had to change some of the quote marks (' and ") to get it to run but this example is excellent. The author downloads all the *.mp4 files.

depends which languages you know... In Java I use Apache common's HTTP Parser (along with their HTTPClient).

I'm sure that there is a widely used HTML parser for this in any language that you are developing in.


Try a regular expression. Something like this should work:

(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text -> 
    // do something with url and text

Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.

Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.

I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.

It is also easier to write and to read.

      <a href="1.html">1</a>
      <a href="2.html">2</a>
      <a href="3.html">3</a>

With the html above, this expression "/html/body/a" will list all href elements.

Here's a good step by step tutorial

Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.

input = """<html><body>
<a href = ""&gt;John&lt;/a&gt;
<a href = ""&gt;Google&lt;/a&gt;
<a href = ""&gt;StackOverflow&lt;/a&gt;

doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { == "a" }.each {
    println "${it.text()}, ${[email protected]()}"
Html parser + Regular expressions Any language would do it, though I'd say Perl is the fastest solution.