views:

492

answers:

2

I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.

EDIT

I would assume it would be something similar to:

soup.findAll('a').

In the documentation, there is a:

find(text="ahh").replaceWith('Hooray')

So I would assume it would be along the lines of:

soup.findAll(tag = '</a>').replaceWith(tag = '</a><br>')

But that doesn't work and the python help() doesn't give much

+2  A: 

This will insert a <br> tag after the end of each <a>...</a> element:

from BeautifulSoup import BeautifulSoup, Tag

# ....

soup = BeautifulSoup(data)
for a in soup.findAll('a'):
    a.parent.insert(a.parent.index(a)+1, Tag(soup, 'br'))

You can't use soup.findAll(tag = '</a>') because BeautifulSoup doesn't operate on the end tags separately - they are considered part of the same element.


If you wanted to put the <a> elements inside a <p> element as you ask in a comment, you can use this:

for a in soup.findAll('a'):
    p = Tag(soup, 'p') #create a P element
    a.replaceWith(p)   #Put it where the A element is
    p.insert(0, a)     #put the A element inside the P (between <p> and </p>)

Again, you don't create the <p> and </p> separately because they are part of the same thing.

interjay
Will that add it to every opening <a> tag as well?
Kevin
See my edit - It will be added after the whole <a>...</a> element, so effectively, it will be only after the </a>.
interjay
Is BeautifulSoup.Tag valid? I am getting an error when trying this code.
Kevin
It depends on how you import the module. I edited to show one way it can work - try it now.
interjay
+1  A: 

You don't replace an end-tag; in BeautifulSoup you are dealing with a document object model like in a browser, not a string full of HTML. So you couldn't ‘replace’ an end-tag without also replacing the start-tag.

What you want to do is insert a new <br> element immediately after the <a>...</a> element. To do so you'll need to find out the index of the <a> element inside its parent element, and insert the new element just after that index. eg.

soup= BeautifulSoup('<body>blah <a href="foo">blah</a> blah</body>')
for link in soup.findAll('a'):
    br= Tag(soup, 'br')
    index= link.parent.contents.index(link)
    link.parent.insert(index+1, br)
# soup now serialises to '<body>blah <a href="foo">blah</a><br /> blah</body>'
bobince
Would I be able to add tags before with a -1. Say I wanted to do <p> and </p>. Could I but the <p> before using index -1 and </p> after using +1?
Kevin
You'd add an element *before* the chosen element using just `index`, not plus or minus anything.
bobince