views:

348

answers:

2

Hi All,

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it.

I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example:

<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>

results = soup.findAll(text=lambda(x): len(x) > 20)

When I use the above code to get at the long text, it breaks (the identified text will start from 'and hopefully..') at the tags. So I tried to replace the tag with plain text as follows:

anchors = soup.findAll('a')

for a in anchors:
  a.replaceWith('plain text')

The above does not work because Beautiful Soup inserts the string as a NavigableString and that causes the same problem when I use findAll with the len(x) > 20. I can use regular expressions to parse the html as plain text first, clear out all the unwanted tags and then call Beautiful Soup. But I would like to avoid processing the same content twice -- I am trying to parse these pages so I can show a snippet of content for a given link (very much like Facebook Share) -- and if everything is done with Beautiful Soup, I presume it will be faster.

So my question: is there a way to 'clear tags' and replace them with 'plain text' using Beautiful Soup. If not, what will be best way to do so?

Thanks for your suggestions!

Update: Alex's code worked very well for the sample example. I also tried various edge cases and they all worked fine (with the modification below). So I gave it a shot on a real life website and I run into issues that puzzle me.

import urllib
from BeautifulSoup import BeautifulSoup

page = urllib.urlopen('http://www.engadget.com/2010/01/12/kingston-ssdnow-v-dips-to-30gb-size-lower-price/')

anchors = soup.findAll('a')
i = 0
for a in anchors:
    print str(i) + ":" + str(a)
    for a in anchors:
        if (a.string is None): a.string = ''
        if (a.previousSibling is None and a.nextSibling is None):
            a.previousSibling = a.string
        elif (a.previousSibling is None and a.nextSibling is not None):
            a.nextSibling.replaceWith(a.string + a.nextSibling)
        elif (a.previousSibling is not None and a.nextSibling is None):
            a.previousSibling.replaceWith(a.previousSibling + a.string)
        else:
            a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
            a.nextSibling.extract()
    i = i+1

When I run the above code, I get the following error:

0:<a href="http://www.switched.com/category/ces-2010"&gt;Stay up to date with 
Switched's CES 2010 coverage</a>
Traceback (most recent call last):
  File "parselink.py", line 44, in <module>
  a.previousSibling.replaceWith(a.previousSibling + a.string + a.nextSibling)
 TypeError: unsupported operand type(s) for +: 'Tag' and 'NavigableString'

When I look at the HTML code, 'Stay up to date.." does not have any previous sibling (I did not how previous sibling worked until I saw Alex's code and based on my testing it looks like it is looking for 'text' before the tag). So, if there is no previous sibling, I am surprised that it is not going through the if logic of a.previousSibling is None and a;nextSibling is None.

Could you please let me know what I am doing wrong?

-ecognium

+2  A: 

An approach that works for your specific example is:

from BeautifulSoup import BeautifulSoup

ht = '''
<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)

anchors = soup.findAll('a')
for a in anchors:
  a.previousSibling.replaceWith(a.previousSibling + a.string)

results = soup.findAll(text=lambda(x): len(x) > 20)

print results

which emits

$ python bs.py
[u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']

Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).

I imagine that besides <a> tags you'll also want to remove others, such as <b> or <strong>, maybe <p> and/or <br>, etc...? I guess this, too, depends on what the actual idea behind your heuristics is!

Alex Martelli
Thanks very much, Alex. Your code works very well for many combinations of the sample that I had posted. However, when I run it on a real website I get strange results. I am not sure what I am doing wrong! I just update the post with my new code. Your help is greatly appreciated. You are correct, I wanted to merge all the text into one giant string. I am basically trying to get the 'content' portion of a page so I can show it a summary. You are also correct, I will have to eventually handle all other tags like <b> <strong> <em>, etc.
Ecognium
@Ecognium, the specific problem you're encountering is when the previous or next sibling does exist but is immediately a tag, not a string -- in that case you cannot concatenate it with a string (so you should basically skip in this case, i.e., perform no alteration!). For handling multiple tags, make sure you iterate over them in order (use a selector function that returns True for all the tags you want to remove, and those only).
Alex Martelli
@Alex, thanks again. That makes sense. I added some instance checks to ignore if the previous sibling is a Tag but even that causes problem. I will debug more and try to figure out the issue. Thanks very much for your time.
Ecognium
A: 

When I tried to flatten tags in the document, that way, the tags' entire content would be pulled up to its parent node in place (I wanted to reduce the content of a p tag with all sub-paragraphs, lists, div and span, etc. inside but get rid of the style and font tags and some horrible word-to-html generator remnants), I found it rather complicated to do with BeautifulSoup itself since extract() also removes the content and replaceWith() unfortunatetly doesn't accept None as argument. After some wild recursion experiments, I finally decided to use regular expressions either before or after processing the document with BeautifulSoup with the following method:

import re
def flatten_tags(s, tags):
   pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
   return pattern.sub("", s)

The tags argument is either a single tag or a list of tags to be flattened.

aldi