ansaurus

Question

Separating HTML into groups using BeautifulSoup when groups are all in the same element

Answer 1

+2 A:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

animals = []
attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animals.append(p.string)
    elif (p['class'] == 'attribute'):
        if animals[-1] not in attributes.keys():
            attributes[animals[-1]] = [p.string]
        else:
            attributes[animals[-1]].append(p.string)

print animals
print attributes

That should work.

Jamie Wong 2010-06-26 16:38:24

Would that not just get me the two animal elements? I don't need anything extra to do that, you can just do: `soup.findAll('p', {'class': 'animal'})`

Acorn 2010-06-26 16:40:29

.. actually all you'd need to do is `soup.findAll('p', 'animal')`

Acorn 2010-06-26 16:46:34

I think I misinterpreted your question - are you trying to group the attributes by animal?

Jamie Wong 2010-06-26 16:48:55

Exactly. If each animal was say, in a separate `<div>`, then I could just iterate over the `<div>`s and I could easily group them. But when all the information is within the same element, I'm not sure how I can keep all the data with the animal it is related to.

Acorn 2010-06-26 16:51:17

Now that I understand your question and have actually installed BeautifulSoup - check answer

Jamie Wong 2010-06-26 17:31:01

Answer 2

+2 A:

If you don't need to keep the animal names in order you can simplify Jamie's answer like this

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animal = p.string
        attributes[animal] = []
    elif (p['class'] == 'attribute'):
        attributes[animal].append(p.string)

print attributes.keys()
print attributes

gnibbler 2010-06-26 17:50:07

ansaurus

tags:

views:

answers:

Separating HTML into groups using BeautifulSoup when groups are all in the same element

related questions