ansaurus

Question

Beautiful Soup and extracting a div and its contents by ID

Answer 1

+2 A:

You should post your example document, because the code works fine:

>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Finding <div>s inside <div>s works as well:

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Lukáš Lalinský 2010-01-25 22:55:30

my example document is enormous. i'm tracking down the problem - i think this doesn't work on divs of divs. I did a count of how many divs are in the document with print len(soup('div')) which resulted in 10, and i can CLEARLY see more than 10 divs with firebug. so i think it just can't find divs inside divs, so i need to narrow things down wrapper by wrapper.

hatorade 2010-01-25 22:59:11

Well, then it's impossible to answer your question, crystal balls are not a reliable way of debugging. :)

Lukáš Lalinský 2010-01-25 23:00:33

Answer 2

A:

have you tried soup.findAll("div", {"id": "articlebody"})?

sounds crazy, but if you're scraping stuff from the wild, you can't rule out multiple divs...

2010-01-25 23:00:55

Answer 3

A:

I used:

soup.findAll('tag', attrs={'attrname':"attrvalue"})

As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.

Ninefingers 2010-01-25 23:02:37

Answer 4

A:

Here is a code fragment

soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})

As you can see I find all tags and then I find all tags with class="article" inside

Recursion 2010-01-25 23:03:03

Answer 5

A:

In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.

NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

What I think you need to do is to specify the attrs you want such as

source.find('div', attrs={'id':'articlebody'})

dagoof 2010-01-25 23:05:25

Answer 6

A:

I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags "div" with class "fcontent":

from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)

omar 2010-03-04 03:34:24

ansaurus

tags:

views:

answers:

Beautiful Soup and extracting a div and its contents by ID

related questions