views:

76

answers:

1

Hey again all,

I have the following script so far:

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2

br = Browser()
br.open("http://www.foo.com")

html = br.response().read(); 

soup = BeautifulSoup(html)
items = soup.findAll(id="info")

and it runs perfectly, and results in the following "items":

<div id="info">
<span class="customer"><b>John Doe</b></span><br>
123 Main Street<br>
Phone:5551234<br>
<b><span class="paid">YES</span></b>
</div>

However, I'd like to take items and clean it up to get

John Doe
123 Main Street
5551234

How can you remove such tags in BeautifulSoup and Python?

As always, thanks!

A: 

This will do it for this EXACT html. Obviously this isn't tolerant of any deviation, so you'll want to add quite a lot of bounds checking and null checking, but here's the nuts and bolts to get your data into plain text.

items = soup.findAll(id="info")
print items[0].span.b.contents[0]
print items[0].contents[3].strip()
print items[0].contents[5].strip().split(":", 1)[1]
Peter Lyons
Thanks, Peter, this is exactly what I needed!
Parker