ansaurus

Question

How do I use regular expressions to parse HTML tags?

Answer 1

A:

Ideally you wouldn't use a Regular expression - they are unsuitable for most parsing tasks, including HTML. Use a parsing library - I'm not an expert python user, but I'm sure there's one to be had.

Eamon Nerbonne 2010-10-07 17:58:02

Answer 2

+8 A:

Using regex to parse HTML has been covered extensively on SO. The consensus is that it shouldn't be done.

Here are some related links worth reading:

One trick I have used in the past to parse HTML files is convert it to XHTML and then treat it as an xml file and use xPath. If this is an option look at:

Abe Miessler 2010-10-07 17:59:52

Answer 3

+6 A:

Regex + HTML...

But BeautifulSoup is a handy library.

>>> from BeautifulSoup import BeautifulSoup
>>> html = '<a href="http://google.com"&gt; Hello World! </a>'
>>> soup = BeautifulSoup(html)
>>> soup.a.string
u' Hello World! '

This, for instance, would print out links on this page:

import urllib2
from BeautifulSoup import BeautifulSoup

q = urllib2.urlopen('http://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())

for link in soup.findAll('a'):
    if link.has_key('href'):
        print str(link.string) + " -> " + link['href']
    elif link.has_key('id'):
        print "ID: " + link['id']
    else:
        print "???"

Output:

Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...

Nick T 2010-10-07 18:01:45

if I had multipal links (<a href=""> blah blah </a>), that only seems to output the first link it comes across?

James Eggers 2010-10-07 18:18:59

There are other methods. `soup.findAll('a')` for instance. See the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html

Manoj Govindan 2010-10-07 18:31:34

I keep hearing about BeautifulSoup but I didn't realize it actually had such a nice API... there are so many tools out there, but a lot of them are just atrocious to use. This is nice :) I've been doing my parsing in C# though.

Mark 2010-10-07 20:53:28

ansaurus

tags:

views:

answers:

How do I use regular expressions to parse HTML tags?

related questions