views:

118

answers:

2

I am aware of the ability to edit text with beautifulsoup, is it possible to edit the href links? I would like to be able to take say <a href="/foo/bar/"> and use beautifulsoup to change it to <a href="http://www.foobarinc.com/foo/bar/"&gt;. I am not sure how I would use beautifulsoup to do this? Any help, much appreciated.

+6  A: 

As in your other question: with BeautifulSoup you're parsing in the content to a set of hierarchically nested objects representing the document, then changing those objects before serialising them back to different text. You're not editing the text directly.

The href="..." part of the markup represents an attribute. To access the attributes of each element in BeautifulSoup you use the el[name] item-style access. So to change rooted URLs into absolute ones in a href attributes, it's as simple as:

for link in soup.findAll('a'):
    if link['href'].startswith('/'):
        link['href']= 'http://www.foobarinc.com'+link['href']
bobince
I was trying similar things to that, but kept getting [none] returned. I cannot find a list of the dictionary that matches the tags. I tried this, and it returned KeyError:href.Also, thanks a lot bobince.
Kevin
+2  A: 

Despite what the OP says in a comment to bobince, the following code works just fine:

from BeautifulSoup import BeautifulSoup

ht = '''
  <a href="/foo/bar/">Hello world</a>
'''
soup = BeautifulSoup(ht)

for link in soup.findAll('a'):
    if link['href'].startswith('/'):
        link['href']= 'http://www.foobarinc.com'+link['href']
print soup

emits

<a href="http://www.foobarinc.com/foo/bar/"&gt;Hello world</a>

as desired. So, instead of vaguely claiming

I was trying similar things to that, but kept getting [none] returned. I cannot find a list of the dictionary that matches the tags. I tried this, and it returned KeyError:href.

(???), the OP had better try to modify the code I just posted, getting it closer and closer to his own, until the weird errors [none] returned and KeyError:href (???) appear: at that time, the very last change that made them appear should make it blatantly obvious what the OP is doing wrong -- if not, post the exact data and code, as I did, and the exactly copy-and-pasted traceback (not vague personal paraphrases!-), and I bet we'll be able to help!-)

Alex Martelli