views:

39

answers:

1

I'm using Beautifulsoup to parse a website

  request = urllib2.Request(url)
  response = urllib2.urlopen(request)
  soup = BeautifulSoup.BeautifulSoup(response)

I am using it to traverse a table. The problem I am running into is that BS is adding an extra end tag for the table into the html which doesn't exist, which I verified with: print soup.prettify(). So, one of the td tags is getting left out of the table and I can't select it.

+1  A: 

How about searching directly for each tag instead of trying to traverse into the table?

   for td in soup.find("td"):
        ...

its not unusual to find the tbody tag nested within a table automatically when its not in the code. Either you can code for it or just jump straight to the tr or td tag.

ebt
That's a good thought and I tried that. When I run the code above it returns the whole table not each individual td. I think BS is breaking on this pages horrible html ... bot sure what to do about it though
bababa
2 things, check the version your using. If you're using 3.1 switch back to 3.0 (http://www.crummy.com/software/BeautifulSoup/3.1-problems.html) else try lxml, IMHO its a better general parser than Soup.
ebt