ansaurus

Question

Answer 1

+2 A:

Given the small snippit of HTML, I've no idea whether this would be effective on the full page, but here's how to extract 'AC/DC' and 'Live' using lxml.etree and xpath.

>>> from lxml import etree
>>> doc = etree.HTML("""<html>
... <head></head>
... <body>
... <tr>
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td>
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td>
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td>
... <td class="or_q_tags_td">
... </tr>
... </body>
... </html>
... """)
>>> doc.xpath('//td[@class="or_q_artist"]/a/text()|//td[@class="or_q_album"]/a/text()')
['AC/DC', 'Live']

MattH 2010-08-09 16:19:45

You can find the full file from http://rateyourmusic.com/collection_p/Makis/oo but you can't read it directly from that site as they seem to block script access.

Makis 2010-08-09 19:00:44

you can't read it directly because you need to be logged in to read it. In other words, unless you post your username and password, no one can read it. If you have any phish, you should post your username and password.

aaronasterling 2010-08-09 19:33:14

Ouch, I didn't check that. You can view anyones collection, but not open the printable page (which has all the albums on one page).

Makis 2010-08-10 17:31:23

I copied the file to http://www.sofistes.net/public_html/rym_list.html .

Makis 2010-08-10 17:37:37

The xpath in my answer works on `rym_list.html`, you'd just need to read two items at a time from the resulting sequence. Can do that easily with `grouper` from the `itertools` docs.

MattH 2010-08-10 17:56:22

Answer 2

A:

See if you can solve the problem in javascript using jQuery style DOM/CSS selectors to get at the elements/text that you want.
If you can then get a copy of BeautifulSoup for python and you should be good to go in a matter of minutes.

dhruvbird 2010-08-09 20:15:47

ansaurus

tags:

views:

answers:

Reading web pages with Python

related questions