tags:

views:

183

answers:

3

I'm trying to regexp a line from a webpage. The line is as follows:

<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>

This is what I tried, but it doesn't seem to work, can anyone help me out? 'htmlbody' contains the html page and no, I did not forget to import 're'.

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody)
print 'Value is', value
+1  A: 

This

import re

htmlbody = "<tr><td width=60 bgcolor='#ffffcc'><b>random Value</b></td><td align=center width=80>"

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.search(htmlbody).group(1)
print 'Value is', value

prints out

Value is random Value

Is this what you want?

clorz
Not completely. It works when the <tr>... string is appointed to htmlbody. However in my script htmlbody is a whole HTML-page. And in that case it doesn't seem to work. I forgot to tell: the page contains multiple instances of this line...
MarcoW
Do you mean that <tr> may be on previous line? Is it possible to exclude it from regexp? You can try reading all the lines, glue them together without linebreaks and search for all occurrences of specific regexp. Or you can try to make regexp more general.
clorz
+4  A: 

There is no surefire way to do this with a regex. See Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why. What you need is an HTML parser like HTMLParser:

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindTDs(HTMLParser):
        def __init__(self):
                HTMLParser.__init__(self)
                self.level = 0

        def handle_starttag(self, tag, attrs):
                if tag == 'td':
                        self.level = self.level + 1

        def handle_endtag(self, tag):
                if tag == 'td':
                        self.level = self.level - 1

        def handle_data(self, data):
                if self.level > 0:
                        print data

find = FindTDs()

html = "<table>\n"
for i in range(3):
        html += "\t<tr>"
        for j in range(5):
                html += "<td>%s.%s</td>" % (i, j)
        html += "</tr>\n"
html += "</table>"

find.feed(html)
Chas. Owens
+1  A: 

It sounds like you may want to use findall rather than search:

reg = re.compile("<tr><td width=60 bgcolor='#ffffcc'><b>([^<]*)</b></td><td align=center width=80>")
value = reg.findall(htmlbody)
print 'Found %i match(es)' % len(value)

I have to caution you, though, that regular expressions are notoriously poor at handling HTML. You're better off using a proper parser using the HTMLParser module built in to Python.

Ben Blank