views:

1287

answers:

4

How can I retrieve the links of a webpage and copy the url adress of the links using Python?

+7  A: 
import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'
Andrew Johnson
I think you should replace response with response.read()
Geo
This code is correct. Paste it to an interpreter
Andrew Johnson
Sorry then :) . I remember I was using it with response.read() every time.
Geo
+4  A: 

Here's a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

http://www.crummy.com/software/BeautifulSoup/documentation.html

Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

ars
+1, using the soup strainer is a great idea because it allows you to circumvent a lot of unnecessary parsing when all you're after are the links.
Evan Fosmark
I edited to add a similar explanation before I saw Evan's comment. Thanks for noting that, though!
ars
thanks, this solve my problem, with this I finish my proyect thanks a lot
NepUS
+1  A: 

just for getting the links, without B.soup and regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

for more complex operations, of course BSoup is still preferred.

ghostdog74
A: 

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

lxml.html also supports CSS3 selectors so this sort of thing is trivial.

Wahnfrieden