views:

158

answers:

1

Starting from an Html input like this:

<p>
<a href="http://www.foo.com"&gt;this if foo</a>
<a href="http://www.bar.com"&gt;this if bar</a>
</p>

using BeautifulSoup, i would like to change this Html in:

<p>
<a href="http://www.foo.com"&gt;this if foo[1]</a>
<a href="http://www.bar.com"&gt;this if bar[2]</a>
</p>

saving parsed links in a dictionary with a result like this:

links_dict = {"1":"http://www.foo.com","2":"http://www.bar.com"}

Is it possible to do this using BeautifulSoup? Any valid alternative?

+1  A: 

This should be easy in Beautiful Soup.

Something like:

from BeautifulSoup import BeautifulSoup
from BeautifulSoup import Tag

count = 1
links_dict = {}
soup = BeautifulSoup(text)
for link_tag in soup.findAll('a'):
  if link_tag['href'] and len(link_tag['href']) > 0:
    links_dict[count]  = link_tag['href']  
    newTag = Tag(soup, "a", link_tag.attrs)
    newTag.insert(0, ''.join([''.join(link_tag.contents), "[%s]" % str(count)]))
    link_tag.replaceWith(newTag)
    count += 1

Result of executing this on your text:

>>> soup
<p>
  <a href="http://www.foo.com"&gt;this if foo[1]</a>
  <a href="http://www.bar.com"&gt;this if bar[2]</a>
</p>

>>> links_dict
{1: u'http://www.foo.com', 2: u'http://www.bar.com'}

The only problem I can foresee with this solution is if your link text contains subtags; then you couldn't do ''.join(link_tag.contents); instead you would need to navigate to the rightmost text element.

danben
@danben +1 for the effort. Actually this is like the code i made before asking the question. It does not work because you end up with something like <a href="http://www.foo.com[1]">this if foo</a> and this is not what i want.
systempuntoout
@systempuntoout: edited; the current code is working for me.
danben
@danben do you think is it possible to change the node's content without recreating a new tag?
systempuntoout
I was not able to do that, and the documentation suggests that there is not. Why is creating a new Tag undesirable?
danben
@Danben uhm, because i could have other attributes besides href; a rel="nofollow" for example.Please, have a look to this other question http://stackoverflow.com/questions/2904542/is-it-possibile-to-modify-a-link-value-with-beautifulsoup-without-recreating-the
systempuntoout
@Danben Ok i found it; i replaced newTag = Tag(soup, "a", [("href", link_tag['href'])]) with newTag = Tag(soup, "a", link_tag.attrs).Thanks!Please update your code.
systempuntoout
Ah, right - I was being lazy.
danben
For the record, with 3.08 you can avoid Tag class and update directly link_tag.string. See this: http://stackoverflow.com/questions/2904542/is-it-possibile-to-modify-a-link-value-with-beautifulsoup-without-recreating-the
systempuntoout