I am aware of the ability to edit text with beautifulsoup, is it possible to edit the href links? I would like to be able to take say <a href="/foo/bar/">
and use beautifulsoup to change it to <a href="http://www.foobarinc.com/foo/bar/">
. I am not sure how I would use beautifulsoup to do this? Any help, much appreciated.
views:
118answers:
2As in your other question: with BeautifulSoup you're parsing in the content to a set of hierarchically nested objects representing the document, then changing those objects before serialising them back to different text. You're not editing the text directly.
The href="..."
part of the markup represents an attribute. To access the attributes of each element in BeautifulSoup you use the el[name]
item-style access. So to change rooted URLs into absolute ones in a href
attributes, it's as simple as:
for link in soup.findAll('a'):
if link['href'].startswith('/'):
link['href']= 'http://www.foobarinc.com'+link['href']
Despite what the OP says in a comment to bobince, the following code works just fine:
from BeautifulSoup import BeautifulSoup
ht = '''
<a href="/foo/bar/">Hello world</a>
'''
soup = BeautifulSoup(ht)
for link in soup.findAll('a'):
if link['href'].startswith('/'):
link['href']= 'http://www.foobarinc.com'+link['href']
print soup
emits
<a href="http://www.foobarinc.com/foo/bar/">Hello world</a>
as desired. So, instead of vaguely claiming
I was trying similar things to that, but kept getting [none] returned. I cannot find a list of the dictionary that matches the tags. I tried this, and it returned KeyError:href.
(???), the OP had better try to modify the code I just posted, getting it closer and closer to his own, until the weird errors [none] returned
and KeyError:href
(???) appear: at that time, the very last change that made them appear should make it blatantly obvious what the OP is doing wrong -- if not, post the exact data and code, as I did, and the exactly copy-and-pasted traceback (not vague personal paraphrases!-), and I bet we'll be able to help!-)