ansaurus

Question

Answer 1

+3 A:

Personally I would use lxml. Once installed, getting what you want is simple:

from lxml import html

tree = html.fromstring(open("data.html").read())

print [e.text_content() for e in tree.xpath("//a")]

John 2010-06-29 22:25:05

Answer 2

+5 A:

Simplest is probably BeautifulSoup (be sure to use 3.0.8 or higher 3.0.* release, not 3.1.*, unless you're on Python 3 -- see here!).

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)

for anchor in soup.findAll('a'):
  print a['href'], a.string

BeautifulSoup produce unicode strings -- if that's a problem, be sure to encode them as you wish to get the byte strings the way you want them!

Alex Martelli 2010-06-29 22:31:10

If following this REALLY DON'T USE 3.1.* (I should read everything before diving in to it) :)

Diego Castro 2010-10-30 13:21:45

Answer 3

+2 A:

sdolan 2010-06-29 22:39:27

I like: `for k, v in attrs: if k == 'href'; return v`

PreludeAndFugue 2010-06-29 22:43:22

Good call. I updated my post.

sdolan 2010-06-29 22:49:02

Answer 4

+1 A:

As long as we're comparing options, this pyparsing snippet also gives you the location for each position, given in the <font> tag following the closing <a> tag:

from pyparsing import makeHTMLTags, SkipTo

a,aEnd = makeHTMLTags("A")
font,fontEnd = makeHTMLTags("FONT")
p,pEnd = makeHTMLTags("P")

patt = (p + a("a") + SkipTo(aEnd)("posn") + aEnd + '-' + 
        font + SkipTo(fontEnd)("locn") + fontEnd + pEnd)

for tokens,_,_ in patt.scanString(the_html):
    print tokens.a.href, '-', tokens.posn, tokens.locn

Gives:

http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T &amp; P/T Sales Associate - Caliente Fashions (North Vancouver)
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT (NORTH VANCOUVER)
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position (New Westminster)
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk (Kits)
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES (VANCOUVER ( KITS ))
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate (Vancouver)
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere (Langley Centre)
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT (Burnaby South)
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE (South Surrey-Semiahmoo)
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales (Coquitlam)

Paul McGuire 2010-06-29 23:05:57

ansaurus

tags:

views:

answers:

get contents of <a> tags using python

related questions