views:

56

answers:

2

Here's an example:

<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>

If each animal was in a separate element I could just iterate over the elements. That would be great. But the website I'm trying to parse has all the information in one element.

What would be the best way of either separating the soup into different animals, or to some other way extract the attributes and which animal they belong to?

(feel free to recommend a better title)

+2  A: 

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

animals = []
attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animals.append(p.string)
    elif (p['class'] == 'attribute'):
        if animals[-1] not in attributes.keys():
            attributes[animals[-1]] = [p.string]
        else:
            attributes[animals[-1]].append(p.string)

print animals
print attributes

That should work.

Jamie Wong
Would that not just get me the two animal elements? I don't need anything extra to do that, you can just do: `soup.findAll('p', {'class': 'animal'})`
Acorn
.. actually all you'd need to do is `soup.findAll('p', 'animal')`
Acorn
I think I misinterpreted your question - are you trying to group the attributes by animal?
Jamie Wong
Exactly. If each animal was say, in a separate `<div>`, then I could just iterate over the `<div>`s and I could easily group them. But when all the information is within the same element, I'm not sure how I can keep all the data with the animal it is related to.
Acorn
Now that I understand your question and have actually installed BeautifulSoup - check answer
Jamie Wong
+2  A: 

If you don't need to keep the animal names in order you can simplify Jamie's answer like this

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animal = p.string
        attributes[animal] = []
    elif (p['class'] == 'attribute'):
        attributes[animal].append(p.string)

print attributes.keys()
print attributes
gnibbler