ansaurus

Question

Python Regexp problem

Answer 1

+1 A:

This

import re

htmlbody = "<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>"

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody).group(1)
print 'Value is', value

prints out

Value is random Value

Is this what you want?

clorz 2009-04-17 22:56:45

Not completely. It works when the <tr>... string is appointed to htmlbody. However in my script htmlbody is a whole HTML-page. And in that case it doesn't seem to work. I forgot to tell: the page contains multiple instances of this line...

MarcoW 2009-04-17 23:01:25

Do you mean that <tr> may be on previous line? Is it possible to exclude it from regexp? You can try reading all the lines, glue them together without linebreaks and search for all occurrences of specific regexp. Or you can try to make regexp more general.

clorz 2009-04-17 23:09:04

Answer 2

+4 A:

There is no surefire way to do this with a regex. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why. What you need is an HTML parser like HTMLParser:

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindTDs(HTMLParser):
        def __init__(self):
                HTMLParser.__init__(self)
                self.level = 0

        def handle_starttag(self, tag, attrs):
                if tag == 'td':
                        self.level = self.level + 1

        def handle_endtag(self, tag):
                if tag == 'td':
                        self.level = self.level - 1

        def handle_data(self, data):
                if self.level > 0:
                        print data

find = FindTDs()

html = "<table>\n"
for i in range(3):
        html += "\t<tr>"
        for j in range(5):
                html += "<td>%s.%s</td>" % (i, j)
        html += "</tr>\n"
html += "</table>"

find.feed(html)

Chas. Owens 2009-04-17 23:22:47

Answer 3

+1 A:

It sounds like you may want to use findall rather than search:

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.findall(htmlbody)
print 'Found %i match(es)' % len(value)

I have to caution you, though, that regular expressions are notoriously poor at handling HTML. You're better off using a proper parser using the HTMLParser module built in to Python.

Ben Blank 2009-04-17 23:26:50

ansaurus

tags:

views:

answers:

Python Regexp problem

related questions