views:

51

answers:

2

Dear all,

i am parsing some html form with Beautiful soup. Basically i´ve around 60 input fields mostly radio buttons and checkboxes. So far this works with the following code:

from BeautifulSoup import BeautifulSoup
x = open('myfile.html','r').read()
out = open('outfile.csv','w')
soup = BeautifulSoup(x)
values = soup.findAll('input',checked="checked")
# echoes some output like ('name',1) and ('value',4)

for cell in values:
# the following line is my problem! 
    statement = cell.attrs[0][1] + ';' + cell.attrs[1][1] + ';\r'
    out.write(statement)

out.close()
x.close()

As indicating in the code my problem ist where the attributes are selected, because the HTML template is ugly, mixing up the sequence of arguments that belong to a input field. I am interested in name="somenumber" value="someothernumber" . Unfortunately my attrs[1] approach does not work, since name and value do not occur in the same sequence in my html.

Is there any way to access the resulting BeautifulSoup list associatively?

Thx in advance for any suggestions!

+2  A: 

I'm fairly sure you can use the attribute name like a key for a hash:

print cell['name']
Peter
hmm can´t accept both answers :) . thx for the help, basically,for cell in values: cell['name'] + ';' + cell['value']did the job, no double loop necessary here.
ran2
+2  A: 

My suggestion is to make values a dict. If soup.findAll returns a list of tuples as you seem to imply, then it's as simple as:

values = dict(soup.findAll('input',checked="checked"))

After that you can simply refer to the values by their attribute name, like what Peter said.

Of course, if soup.findAll doesn't return a list of tuples as you've implied, or if your problem is that the tuples themselves are being returned in some weird way (such that instead of ('name', 1) it would be (1, 'name')), then it could be a bit more complicated.

On the other hand, if soup.findAll returns one of a certain set of data types (dict or list of dicts, namedtuple or list of namedtuples), then you'll actually be better off because you won't have to do any conversion in the first place.

...Yeah, after checking the BeautifulSoup documentation, it seems that findAll returns an object that can be treated like a list of dicts, so you can just do as Peter says.

http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

Oh yeah, if you want to enumerate through the attributes, just do something like this:

for cell in values:
    for attribute in cell:
        out.write(attribute + ';' + str(cell[attribute]) + ';\r')
JAB
for me, one loop was enough.. please see my comment to Peter´s answers.
ran2