ansaurus

Question

Answer 1

+4 A:

I don't know if BeautifulSoup can do it more elegantly, but you could merge the two loops like so:

for tag in soup.findAll(['script', 'form']) + soup.findAll(id="footer"):
    tag.extract()

You can find classes like so (Documentation):

for tag in soup.findAll(attrs={'class': 'noprint'}):
    tag.extract()

Yacoby 2009-12-01 10:05:26

Its working good, but doesn't look clean combining long loops ...+...+...+...+...+...+...+...Is there any other better method.

Priyank Bolia 2009-12-01 10:33:30

Answer 2

A:

The answer to the second part of your question is right there in the documentation:

Searching by CSS class

The attrs argument would be a pretty obscure feature were it not for one thing: CSS. It's very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, class, is also a Python reserved word.

You could search by CSS class with soup.find("tagName", { "class" : "cssClass" }), but that's a lot of code for such a common operation. Instead, you can pass a string for attrs instead of a dictionary. The string will be used to restrict the CSS class.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("""Bob's Bold Barbeque Sauce now available in 
 Hickory and Lime</a>""")

soup.find("b", { "class" : "lime" })
# Lime

soup.find("b", "hickory")
# Hickory

hop 2009-12-01 10:09:45

uh... pardon me? why the downvote?

hop 2009-12-01 10:42:11

Answer 3

+4 A:

You can pass functions to .findall() like this:

soup.findAll(lambda tag: tag.name in ['script', 'form'] or tag['id'] == footer)

But you might be better off by first building a list of tags and then iterating over it:

tags = soup.findAll(['script', 'form'])
tags.extend(soup.findAll(id="footer"))

for tag in tags:
    tag.extract()

If you want to filter for several ids, you can use:

for tag in soup.findAll(lambda tag: tag.has_key('id') and
                                    tag['id'] in ['footer', 'content', 'links']):
    tag.extract()

A more specific approach would be to assign a lambda to the id parameter:

for tag in soup.findAll(id=lambda value: value in ['footer', 'content', 'links']):
    tag.extract()

hop 2009-12-01 10:41:37

I am getting errors: SyntaxError: invalid syntax

Priyank Bolia 2009-12-01 10:47:08

SyntaxError? Strange... you should get a TypeError.

hop 2009-12-01 10:51:27

that fixed the TypeError

hop 2009-12-01 11:03:08

for tag in soup.findAll(lambda tag: tag['id'] in ['footer', 'content', 'links']) does not work. KeyError: 'id' I am using Py2.6, it that helps.

Priyank Bolia 2009-12-01 11:37:10

This works though: for tag in soup.findAll(id=lambda(value): value and value in ['catlinks', 'siteSub', 'contentSub'])

Priyank Bolia 2009-12-01 11:44:41

i pasted the wrong version, sorry. this correction guards against tags that don't have the id attribute.

hop 2009-12-01 11:56:39

your version works too, of course, but i wanted to show a more general approach

hop 2009-12-01 11:58:32

ansaurus

tags:

views:

answers:

Need help with Python/BeautifulSoup

related questions