views:

2048

answers:

7

I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().

Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:

for line in lines:
    result = re.match ('/href="(.*)"/iU', line)
    print result

This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.

Can someone give me a hint on this?

Thanks in advance

+3  A: 

There's an HTML parser that comes standard in Python. Checkout htmllib.

eduffy
htmllib is deprecated in python 3.0, so for the sake of future compatibility, I would like to avoid it.
rogeriopvl
+10  A: 

Beautiful Soup can do this almost trivially:

from BeautifulSoup import BeautifulSoup as soup

html = soup('<body><a href="123">qwe</a><a href="456">asd</a></body>')
print [tag.attrMap['href'] for tag in html.findAll('a', {'href': True})]
Ignacio Vazquez-Abrams
That does it perfectly. Thanks
rogeriopvl
+1  A: 

Don't divide the html content into lines, as there maybe multiple matches in a single line. Also don't assume there is always quotes around the url.

Do something like this:

links = re.finditer(' href="?([^\s^"]+)', content)

for link in links:
  print link
Jiayao Yu
+6  A: 

Another alternative to BeautifulSoup is lxml (http://codespeak.net/lxml/);

import lxml.html
links = lxml.html.parse("http://stackoverflow.com/").xpath("//a/@href")
for link in links:
    print link
adw
+4  A: 

What others haven't told you is that using regular expressions for this is not a reliable solution.
Using regular expression will give you wrong results on many situations: if there are <A> tags that are commented out, or if there are text in the page which include the string "href=", or if there are <textarea> elements with html code in it, and many others. Plus, the href attribute may exist on tags other that the anchor tag.

What you need for this is XPath, which is a query language for DOM trees, i.e. it lets you retrieve any set of nodes satisfying the conditions you specify (HTML attributes are nodes in the DOM).
XPath is a well standarized language now a days (W3C), and is well supported by all major languages. I strongly suggest you use XPath and not regexp for this.
adw's answer shows one example of using XPath for your particular case.

GetFree
+2  A: 

As previously mentioned: regex does not have the power to parse HTML. Do not use regex for parsing HTML. Do not pass Go. Do not collect £200.

Use an HTML parser.

But for completeness, the primary problem is:

re.match ('/href="(.*)"/iU', line)

You don't use the “/.../flags” syntax for decorating regexes in Python. Instead put the flags in a separate argument:

re.match('href="(.*)"', line, re.I|re.U)

Another problem is the greedy ‘.*’ pattern. If you have two hrefs in a line, it'll happily suck up all the content between the opening " of the first match and the closing " of the second match. You can use the non-greedy ‘.*?’ or, more simply, ‘[^"]*’ to only match up to the first closing quote.

But don't use regexes for parsing HTML. Really.

bobince
+1  A: 

Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.

Here follows the code to list all URL's from a webpage:

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)

import urllib, urllister
usock = urllib.urlopen("http://diveintopython.org/")
parser = urllister.URLLister()
parser.feed(usock.read())         
usock.close()      
parser.close()                    
for url in parser.urls: print url

Thanks for all the replies.

rogeriopvl