ansaurus

Question

Answer 1

+1 A:

You can use the HTMLParser module.

The code would probably look something like this:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        # Only parse the 'anchor' tag.
        if tag == "a":
           # Check the list of defined attributes.
           for name, value in attrs:
               # If href is defined, print it.
               if name == "href":
                   print name, "=", value


parser = MyHTMLParser()
parser.feed(your_html_string)

Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.

Stephen 2010-06-19 13:02:24

Answer 2

+3 A:

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    print link.get('href')

In case you just want links starting with http://, you should use:

soup.findAll('a', attrs={'href': re.compile("^http://")})

systempuntoout 2010-06-19 13:04:10

Answer 3

A:

Look at using the beautiful soup html parsing library.

http://www.crummy.com/software/BeautifulSoup/

You will do something like this:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
    print a.get("href")

Peter Lyons 2010-06-19 13:07:17

ansaurus

tags:

views:

answers:

how can I get href links from html code

related questions