ansaurus

Question

Getting the value of href attributes in all <a> tags on a html file with Python

Answer 1

+3 A:

There's an HTML parser that comes standard in Python. Checkout htmllib.

eduffy 2009-03-22 17:28:39

htmllib is deprecated in python 3.0, so for the sake of future compatibility, I would like to avoid it.

rogeriopvl 2009-03-22 17:44:50

Answer 2

+10 A:

Beautiful Soup can do this almost trivially:

from BeautifulSoup import BeautifulSoup as soup

html = soup('<body><a href="123">qwe</a><a href="456">asd</a></body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]

Ignacio Vazquez-Abrams 2009-03-22 17:43:14

That does it perfectly. Thanks

rogeriopvl 2009-03-22 19:11:31

Answer 3

+1 A:

Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.

Do something like this:

links = re.finditer(' href="?([^\s^"]+)', content)

for link in links:
  print link

Jiayao Yu 2009-03-22 17:57:58

Answer 4

+6 A:

Another alternative to BeautifulSoup is lxml (http://codespeak.net/lxml/);

import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/@href")
for link in links:
    print link

adw 2009-03-22 18:00:18

Answer 5

+4 A:

What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.

What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.

GetFree 2009-03-22 18:31:56

Answer 6

+2 A:

As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.

Use an HTML parser.

But for completeness, the primary problem is:

re.match ('/href="(.*)"/iU', line)

You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:

re.match('href="(.*)"', line, re.I|re.U)

Another problem is the greedy ‘.*’ pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?’ or, more simply, ‘[^"]*’ to only match up to the first closing quote.

But don't use regexes for parsing HTML. Really.

bobince 2009-03-23 00:14:53

Answer 7

+1 A:

Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.

Here follows the code to list all URL's from a webpage:

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)

import urllib, urllister
usock = urllib.urlopen("http://diveintopython.org/")
parser = urllister.URLLister()
parser.feed(usock.read())         
usock.close()      
parser.close()                    
for url in parser.urls: print url

Thanks for all the replies.

rogeriopvl 2009-03-23 07:54:07

ansaurus

tags:

views:

answers:

Getting the value of href attributes in all <a> tags on a html file with Python

related questions