tags:

views:

170

answers:

3

Hi,

Was wondering how I would extrapolate the value of an html element using a regular expression (in python preferably).

For example, <a href="http://google.com"&gt; Hello World! </a>

What regex would I use to extract Hello World! from the above html?

Thanks in advance, James Eggers.

A: 

Ideally you wouldn't use a Regular expression - they are unsuitable for most parsing tasks, including HTML. Use a parsing library - I'm not an expert python user, but I'm sure there's one to be had.

Eamon Nerbonne
+8  A: 

Using regex to parse HTML has been covered extensively on SO. The consensus is that it shouldn't be done.

Here are some related links worth reading:

One trick I have used in the past to parse HTML files is convert it to XHTML and then treat it as an xml file and use xPath. If this is an option look at:

Abe Miessler
+6  A: 

Regex + HTML...

But BeautifulSoup is a handy library.

>>> from BeautifulSoup import BeautifulSoup
>>> html = '<a href="http://google.com"&gt; Hello World! </a>'
>>> soup = BeautifulSoup(html)
>>> soup.a.string
u' Hello World! '

This, for instance, would print out links on this page:

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('http://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

Output:

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...
Nick T
if I had multipal links (<a href=""> blah blah </a>), that only seems to output the first link it comes across?
James Eggers
There are other methods. `soup.findAll('a')` for instance. See the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html
Manoj Govindan
I keep hearing about BeautifulSoup but I didn't realize it actually had such a nice API... there are so many tools out there, but a lot of them are just atrocious to use. This is nice :) I've been doing my parsing in C# though.
Mark