ansaurus

Question

Answer 1

+4 A:

Statutory Warning: don't use regular expressions to parse (X)HTML. You are much better off using a parser such as BeautifulSoup.

For e.g.

>>> from BeautifulSoup import BeautifulSoup
>>> html = """<html><table border = 1><tr><td>JDICOM</td><td>Thu Sep 16 10:13:34 CDT 2010</td></tr></html>"""
>>> soup = BeautifulSoup(html)
>>> for each in soup.findAll(name = 'td'):
 print each.contents[0]


JDICOM
Thu Sep 16 10:13:34 CDT 2010
>>>

That said, here is a regular expression to do the same thing. Warning: this will stop working if the markup is irregular.

>>> import re
>>> pattern = re.compile('<td>(.*?)</td>', re.I | re.S)
>>> for each in pattern.findall(html):
 print each


JDICOM
Thu Sep 16 10:13:34 CDT 2010
>>>

Manoj Govindan 2010-09-16 10:23:43

Thank you very much

u3050 2010-09-16 10:56:43

ansaurus

tags:

views:

answers:

Python String split with multiple regex

related questions