views:

293

answers:

3

I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have:

soup = BeautifulSoup(url_opener.open(url))            
x = soup('table', text = re.compile("Owner Name"))
print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next

The relevant HTML is

<td valign="top">
    <table border="1" cellpadding="1" cellspacing="0" align="right">
    <tbody><tr class="tableheaders">
    <td>Owner Name(s)</td>
    </tr>

    <tr>

    <td>PILCHER DONALD L                         </td>
    </tr>

    </tbody></table>
</td>

Wow, there are lots of questions about beautifulsoup, I looked through them but didn't find an answer that helped me, hopefully this is not a duplicate question

+1  A: 

This is a slight improvement, but I couldn't figure out how to get rid of the three parents.

x[0].parent.parent.parent.findAll('td')[1].string
Mark Byers
+2  A: 

(Edit: apparently the HTML the OP posted lies -- there is in fact no tbody tag to look for, even though he made it a point of including in that HTML. So, changing to use table instead of tbody).

As there may be several table-rows you want (e.g., see the sibling URL to the one you give, with the last digit, 4, changed into a 5), I suggest a loop such as the following:

# locate the table containing a cell with the given text
owner = re.compile('Owner Name')
cell = soup.find(text=owner).parent
while cell.name != 'table': cell = cell.parent
# print all non-empty strings in the table (except for the given text)
for x in cell.findAll(text=lambda x: x.strip() and not owner.match(x)):
  print x

this is reasonably robust to minor changes in page structure: having located the cell of interest, it loops up its parents until it's found the table tag, then over all nagivable strings within that table that aren't empty (or just whitespace), excluding the owner header.

Alex Martelli
Thanks for the answer, I get an errorcell.name has no attribute name I guess I can use a try, not real familiar with using try, Is there a better way to address this?
Vincent
The URL you gave has no such error w/my code (that's why I have the `.parent` in the 2nd line of my code: to move up from the navigable string, to a tag, which _does_ have a name). What exact URL has such a problem with the code I posted in my answer?
Alex Martelli
I just checked this URL, and there is no `<tbody>` tag. I think you'll just have to look for the "Owner Name(s)" table column header, and then read the values in all rows of that table.
Paul McGuire
Like Paul said there is no tboby, the url I am using is the one posted. I guess the solution that would make the most sense to me is to be able to find a table based on some content. Then select the item in the table I want. (soup(Find a table that has "owner name"))
Vincent
@Vincent, so why do you show as "the relevant HTML" one **with** `tbody`? Ah well, just use `table` instead of `tbody` in the third line. Here, let me edit the answer to show that trivial change.
Alex Martelli
@vincent, there -- edited and added comment to show how the first three lines do **exactly** "find a table based on some content", the next two emit the (**plural** of course!-) other item**s** (strings) in that table. Not sure what you mean by "select" (?) and by using the singular, any more than I have any idea about why you showed a tbody tag that just wasn't there -- ah well!-)
Alex Martelli
Duh, Clearly there is a tbody, Sorry about that. Still doesn't work for me but that might be my problem.I'd like to accept your answer so I will try more, I am going to post an aswer I got at beautiful soup group although I like the answer I did not get it as promptly as you answered, Thanks again.
Vincent
@vincent: Clearly there is *NOT* a tbody in the HTML obtained by reading that URL.
John Machin
Not sure what the deal is with the tbody. I asure you I did not type by hand the "relevant html" and that I did copy and pasted. Thanks for the time you spent on this Alex Martelli
Vincent
+3  A: 

This is Aaron DeVore's answer from the Beautifulsoup discussion group, It work well for me.

soup = BeautifulSoup(...)
label = soup.find(text="Owner Name(s)")

Needs Tag.string to get to the actual name string

name = label.findNext('td').string

If you're doing a bunch of them, you can even go for a list comprehension.

names = [unicode(label.findNext('td').string) for label in
soup.findAll(text="Owner Name(s)")]
Vincent