ansaurus

Question

retrieve links from web page using python and beautiful soup

Answer 1

+7 A:

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

Andrew Johnson 2009-07-03 18:37:53

I think you should replace response with response.read()

Geo 2009-07-03 18:57:05

This code is correct. Paste it to an interpreter

Andrew Johnson 2009-07-03 19:26:57

Sorry then :) . I remember I was using it with response.read() every time.

Geo 2009-07-03 19:42:54

Answer 2

+4 A:

Here's a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        print link['href']

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

http://www.crummy.com/software/BeautifulSoup/documentation.html

Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

ars 2009-07-03 18:53:55

+1, using the soup strainer is a great idea because it allows you to circumvent a lot of unnecessary parsing when all you're after are the links.

Evan Fosmark 2009-07-03 18:57:34

I edited to add a similar explanation before I saw Evan's comment. Thanks for noting that, though!

ars 2009-07-03 19:01:16

thanks, this solve my problem, with this I finish my proyect thanks a lot

NepUS 2009-07-03 21:17:57

Answer 3

+1 A:

just for getting the links, without B.soup and regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

for more complex operations, of course BSoup is still preferred.

ghostdog74 2009-07-04 03:11:21

Answer 4

A:

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

lxml.html also supports CSS3 selectors so this sort of thing is trivial.

Wahnfrieden 2009-08-03 15:34:01

ansaurus

tags:

views:

answers:

retrieve links from web page using python and beautiful soup

related questions