ansaurus

Question

Beautiful Soup cannot find a CSS class if the object has other classes, too

Answer 1

+4 A:

Unfortunately, BeautifulSoup treats this as a class with a space in it 'class1 class2' rather than two classes ['class1','class2']. A workaround is to use a regular expression to search for the class instead of a string.

This works:

soup.findAll(True, {'class': re.compile(r'\bclass1\b')})

endolith 2009-08-07 03:49:51

https://bugs.launchpad.net/bugs/410304

endolith 2009-08-07 14:57:14

Answer 2

+3 A:

You should use lxml. It works with multiple class values separated by spaces ('class1 class2').

Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees and prefers lxml over BeautifulSoup. (He's the lead of the HTML5 spec development committee)

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.

Wahnfrieden 2009-08-07 15:18:07

From lxml's own documentation: "While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superiour support for encoding detection."

endolith 2009-08-10 18:41:01

I've tried it and it is indeed nicer for this sort of thing.

endolith 2009-08-12 19:37:19

Glad you like it. Hope you'll spread the word too, lxml is an under-appreciated library. I think many overlook it since it has 'XML' in the name and its documentation isn't as nice as BeautifulSoup's. BS has a charm to it with the name and graphics, which makes it a little more attractive for superficial reasons.

Wahnfrieden 2009-08-12 20:12:08

Yes, it isn't marketed as a scraper and I don't see enough examples of this kind of stuff in the docs.

endolith 2009-08-15 18:19:04

ansaurus

tags:

views:

answers:

Beautiful Soup cannot find a CSS class if the object has other classes, too

related questions