ansaurus

Question

Using Beautiful Soup Python module to replace tags with plain text

Answer 1

+2 A:

An approach that works for your specific example is:

from BeautifulSoup import BeautifulSoup

ht = '''
<div id="abc">
    some long text goes <a href="/"> here </a> and hopefully it 
    will get picked up by the parser as content
</div>
'''
soup = BeautifulSoup(ht)

anchors = soup.findAll('a')
for a in anchors:
  a.previousSibling.replaceWith(a.previousSibling + a.string)

results = soup.findAll(text=lambda(x): len(x) > 20)

print results

which emits

$ python bs.py
[u'\n    some long text goes  here ', u' and hopefully it \n    will get picked up by the parser as content\n']

Of course, you'll probably need to take a bit more care, i.e., what if there's no a.string, or if a.previousSibling is None -- you'll need suitable if statements to take care of such corner cases. But I hope this general idea can help you. (In fact you may want to also merge the next sibling if it's a string -- not sure how that plays with your heuristics len(x) > 20, but say for example that you have two 9-character strings with an <a> containing a 5-character strings in the middle, perhaps you'd want to pick up the lot as a "23-characters string"? I can't tell because I don't understand the motivation for your heuristic).

I imagine that besides <a> tags you'll also want to remove others, such as  or , maybe  and/or  , etc...? I guess this, too, depends on what the actual idea behind your heuristics is!

Alex Martelli 2010-01-14 02:49:02

Thanks very much, Alex. Your code works very well for many combinations of the sample that I had posted. However, when I run it on a real website I get strange results. I am not sure what I am doing wrong! I just update the post with my new code. Your help is greatly appreciated. You are correct, I wanted to merge all the text into one giant string. I am basically trying to get the 'content' portion of a page so I can show it a summary. You are also correct, I will have to eventually handle all other tags like , etc.

Ecognium 2010-01-14 05:25:56

@Ecognium, the specific problem you're encountering is when the previous or next sibling does exist but is immediately a tag, not a string -- in that case you cannot concatenate it with a string (so you should basically skip in this case, i.e., perform no alteration!). For handling multiple tags, make sure you iterate over them in order (use a selector function that returns True for all the tags you want to remove, and those only).

Alex Martelli 2010-01-14 05:43:34

@Alex, thanks again. That makes sense. I added some instance checks to ignore if the previous sibling is a Tag but even that causes problem. I will debug more and try to figure out the issue. Thanks very much for your time.

Ecognium 2010-01-14 06:19:04

Answer 2

A:

When I tried to flatten tags in the document, that way, the tags' entire content would be pulled up to its parent node in place (I wanted to reduce the content of a p tag with all sub-paragraphs, lists, div and span, etc. inside but get rid of the style and font tags and some horrible word-to-html generator remnants), I found it rather complicated to do with BeautifulSoup itself since extract() also removes the content and replaceWith() unfortunatetly doesn't accept None as argument. After some wild recursion experiments, I finally decided to use regular expressions either before or after processing the document with BeautifulSoup with the following method:

import re
def flatten_tags(s, tags):
   pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
   return pattern.sub("", s)

The tags argument is either a single tag or a list of tags to be flattened.

aldi 2010-07-02 17:05:58

ansaurus

tags:

views:

answers:

Using Beautiful Soup Python module to replace tags with plain text

related questions