Hi All,
I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it.
I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example:
<div id="abc">
some long text goes <a href="/"> here </a> and hopefully it
will get picked up by the parser as content
</div>
results = soup.findAll(text=lambda(x): len(x) > 20)
When I use the above code to get at the long text, it breaks (the identified text will start from 'and hopefully..') at the tags. So I tried to replace the tag with plain text as follows:
anchors = soup.findAll('a')
for a in anchors:
a.replaceWith('plain text')
The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice -- I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) -- and if everything is done with Beautiful Soup, I presume it will be faster.
So my question: is there a way to 'clear tags' and replace them with 'plain text' using Beautiful Soup. If not, what will be best way to do so?
Thanks for your suggestions!
Update: Alex's code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.
import urllib
from BeautifulSoup import BeautifulSoup
page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')
anchors = soup.findAll('a')
i = 0
for a in anchors:
print str(i) + ":" + str(a)
for a in anchors:
if (a.string is None): a.string = ''
if (a.previousSibling is None and a.nextSibling is None):
a.previousSibling = a.string
elif (a.previousSibling is None and a.nextSibling is not None):
a.nextSibling.replaceWith(a.string + a.nextSibling)
elif (a.previousSibling is not None and a.nextSibling is None):
a.previousSibling.replaceWith(a.previousSibling + a.string)
else:
a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
a.nextSibling.extract()
i = i+1
When I run the above code, I get the following error:
0:<a href="http://www.switched.com/category/ces-2010">Stay up to date with
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
File "parselink.py", line 44, in <module>
a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'
When I look at the HTML code, 'Stay up to date.." does not have any previous sibling (I did not how previous sibling worked until I saw Alex's code and based on my testing it looks like it is looking for 'text' before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.
Could you please let me know what I am doing wrong?
-ecognium