tags:

views:

93

answers:

2

I'm trying to read and handle a web-page in Python which has lines like the following in it:

              <div class="or_q_tagcloud" id="tag1611"></div></td></tr><tr><td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td><td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td><td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td><td class="or_q_tags_td">

I'm currently only interested in the artist name (AC/DC) and album name (Live). I can read and print them with libxml2dom but I can't figure out how I can distinguish between the links because the node value for every link is None.

One obvious way would be to read the input line at a time but is there a more clever way of handling this html file so that I can create either two separate lists where each index matches the other or a struct with this info?

import urllib
import sgmllib
import libxml2dom

def collect_text(node):
  "A function which collects text inside 'node', returning that text."

  s = ""
  for child_node in node.childNodes:
    if child_node.nodeType == child_node.TEXT_NODE:
        s += child_node.nodeValue
    else:
        s += collect_text(child_node)
  return s

  f = urllib.urlopen("/home/x/Documents/rym_list.html")

  s = f.read()

  doc = libxml2dom.parseString(s, html=1)

  links = doc.getElementsByTagName("a")
  for link in links:
    print "--\nNode " , artist.childNodes
    if artist.localName == "artist":
      print "artist"
    print collect_text(artist).encode('utf-8')

  f.close()
+2  A: 

Given the small snippit of HTML, I've no idea whether this would be effective on the full page, but here's how to extract 'AC/DC' and 'Live' using lxml.etree and xpath.

>>> from lxml import etree
>>> doc = etree.HTML("""<html>
... <head></head>
... <body>
... <tr>
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td>
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td>
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td>
... <td class="or_q_tags_td">
... </tr>
... </body>
... </html>
... """)
>>> doc.xpath('//td[@class="or_q_artist"]/a/text()|//td[@class="or_q_album"]/a/text()')
['AC/DC', 'Live']
MattH
You can find the full file from http://rateyourmusic.com/collection_p/Makis/oo but you can't read it directly from that site as they seem to block script access.
Makis
you can't read it directly because you need to be logged in to read it. In other words, unless you post your username and password, no one can read it. If you have any phish, you should post your username and password.
aaronasterling
Ouch, I didn't check that. You can view anyones collection, but not open the printable page (which has all the albums on one page).
Makis
I copied the file to http://www.sofistes.net/public_html/rym_list.html .
Makis
The xpath in my answer works on `rym_list.html`, you'd just need to read two items at a time from the resulting sequence. Can do that easily with `grouper` from the `itertools` docs.
MattH
A: 
  1. See if you can solve the problem in javascript using jQuery style DOM/CSS selectors to get at the elements/text that you want.
  2. If you can then get a copy of BeautifulSoup for python and you should be good to go in a matter of minutes.
dhruvbird