views:

852

answers:

2

if a page has <div class="class1"> and <p class="class1">, then soup.findAll(True, 'class1') will find them both.

If it has <p class="class1 class2">, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?

+4  A: 

Unfortunately, BeautifulSoup treats this as a class with a space in it 'class1 class2' rather than two classes ['class1','class2']. A workaround is to use a regular expression to search for the class instead of a string.

This works:

soup.findAll(True, {'class': re.compile(r'\bclass1\b')})
endolith
https://bugs.launchpad.net/bugs/410304
endolith
+3  A: 

You should use lxml. It works with multiple class values separated by spaces ('class1 class2').

Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees and prefers lxml over BeautifulSoup. (He's the lead of the HTML5 spec development committee)

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.

Wahnfrieden
From lxml's own documentation: "While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superiour support for encoding detection."
endolith
I've tried it and it is indeed nicer for this sort of thing.
endolith
Glad you like it. Hope you'll spread the word too, lxml is an under-appreciated library. I think many overlook it since it has 'XML' in the name and its documentation isn't as nice as BeautifulSoup's. BS has a charm to it with the name and graphics, which makes it a little more attractive for superficial reasons.
Wahnfrieden
Yes, it isn't marketed as a scraper and I don't see enough examples of this kind of stuff in the docs.
endolith