ansaurus

Question

Python web scraping involving HTML tags with attributes

Answer 1

+5 A:

It's not clear to me from your question why you need to worry about the div tags -- what about doing just:

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

On the HTML you give, running this emits exactly:

####I want whatever is located here ###

which appears to be what you want. Maybe you can specify better exactly what it is you need and this super-simple snippet doesn't do -- multiple td tags all of class author of which you need to consider (all? just some? which ones?), possibly missing any such tag (what do you want to do in that case), and the like. It's hard to infer what exactly are your specs, just from this simple example and overabundant code;-).

Edit: if, as per the OP's latest comment, there are multiple such td tags, one per author:

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

...i.e., not much harder at all!-)

Alex Martelli 2009-09-08 03:01:06

Thanks, Alex. I have multiple authors on the page, so I will be having multiple td tags. How do I iterate over each of them?

rohanbk 2009-09-08 03:21:42

Answer 2

A:

BeautifulSoup is certainly the canonical HTML parser/processor. But if you have just this kind of snippet you need to match, instead of building a whole hierarchical object representing the HTML, pyparsing makes it easy to define leading and trailing HTML tags as part of creating a larger search expression:

from pyparsing import makeHTMLTags, withAttribute, SkipTo

author_td, end_td = makeHTMLTags("td")

# only interested in <td>'s where class="author"
author_td.setParseAction(withAttribute(("class","author")))

search = author_td + SkipTo(end_td)("body") + end_td

for match in search.searchString(html):
    print match.body

Pyparsing's makeHTMLTags function does a lot more than just emit "<tag>" and "</tag>" expressions. It also handles:

caseless matching of tags
"<tag/>" syntax
zero or more attribute in the opening tag
attributes defined in arbitrary order
attribute names with namespaces
attribute values in single, double, or no quotes
intervening whitespace between tag and symbols, or attribute name, '=', and value
attributes are accessible after parsing as named results

These are the common pitfalls when considering using a regex for HTML scraping.

Paul McGuire 2009-09-08 03:31:52

Answer 3

A:

or you could be using pyquery, since BeautifulSoup is not actively maintained anymore, see http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

first, install pyquery with

easy_install pyquery

then your script could be as simple as

from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]

pyquery uses the css selector syntax familiar from jQuery which I find more intuitive than BeautifulSoup's. It uses lxml underneath, and is much faster than BeautifulSoup. But BeautifulSoup is pure python, and thus works on Google's app engine as well

captnswing 2010-05-02 07:01:44

ansaurus

tags:

views:

answers:

Python web scraping involving HTML tags with attributes

related questions