views:

77

answers:

4

Assuming I have html read into my program like this:

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html"&gt;F/T &amp; P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html"&gt;IMMEDIATE EMPLOYMENT WANTED!</a> - </p>

<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html"&gt;TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html"&gt;Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html"&gt;Sales Clerk</a> - <font size="-1"> (Kits)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html"&gt;MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html"&gt;Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html"&gt;Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html"&gt;Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>

<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html"&gt;Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html"&gt;STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html"&gt;Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>

How do I grab the contents of the text node? What I would like to end up with is printing something similar to this line in the terminal:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT

So far I have the following code which extracts the href link fine but I'm not sure how to extract the data itself. I'm thinking of overriding handle_data(self, data) from the sgmllib.py module but so far I can't seem to think of a way to do it.

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k == "href"]
        if href:
            self.urls.extend(href)

Thanks!

+3  A: 

Personally I would use lxml. Once installed, getting what you want is simple:

from lxml import html

tree = html.fromstring(open("data.html").read())

print [e.text_content() for e in tree.xpath("//a")]
John
+5  A: 

Simplest is probably BeautifulSoup (be sure to use 3.0.8 or higher 3.0.* release, not 3.1.*, unless you're on Python 3 -- see here!).

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)

for anchor in soup.findAll('a'):
  print a['href'], a.string

BeautifulSoup produce unicode strings -- if that's a problem, be sure to encode them as you wish to get the byte strings the way you want them!

Alex Martelli
If following this REALLY DON'T USE 3.1.* (I should read everything before diving in to it) :)
Diego Castro
+2  A: 
sdolan
I like: `for k, v in attrs: if k == 'href'; return v`
PreludeAndFugue
Good call. I updated my post.
sdolan
+1  A: 

As long as we're comparing options, this pyparsing snippet also gives you the location for each position, given in the <font> tag following the closing <a> tag:

from pyparsing import makeHTMLTags, SkipTo

a,aEnd = makeHTMLTags("A")
font,fontEnd = makeHTMLTags("FONT")
p,pEnd = makeHTMLTags("P")

patt = (p + a("a") + SkipTo(aEnd)("posn") + aEnd + '-' + 
        font + SkipTo(fontEnd)("locn") + fontEnd + pEnd)

for tokens,_,_ in patt.scanString(the_html):
    print tokens.a.href, '-', tokens.posn, tokens.locn

Gives:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T &amp; P/T Sales Associate - Caliente Fashions (North Vancouver)
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT (NORTH VANCOUVER)
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position (New Westminster)
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk (Kits)
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES (VANCOUVER ( KITS ))
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate (Vancouver)
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere (Langley Centre)
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT (Burnaby South)
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE (South Surrey-Semiahmoo)
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales (Coquitlam)
Paul McGuire