tags:

views:

74

answers:

1

Hi I have Python String as shown below:

<html><table border = 1><tr><td>JDICOM</td><td>Thu Sep 16 10:13:34 CDT 2010</td></tr></html>

From above string I am interested in two words

JDICOM
Thu Sep 16 10:13:34 CDT 2010

I tried find, findall, split but it did not help because of multiple regex.

I am quite new to python. If anyone knows please help.

+4  A: 

Statutory Warning: don't use regular expressions to parse (X)HTML. You are much better off using a parser such as BeautifulSoup.

For e.g.

>>> from BeautifulSoup import BeautifulSoup
>>> html = """<html><table border = 1><tr><td>JDICOM</td><td>Thu Sep 16 10:13:34 CDT 2010</td></tr></html>"""
>>> soup = BeautifulSoup(html)
>>> for each in soup.findAll(name = 'td'):
 print each.contents[0]


JDICOM
Thu Sep 16 10:13:34 CDT 2010
>>> 

That said, here is a regular expression to do the same thing. Warning: this will stop working if the markup is irregular.

>>> import re
>>> pattern = re.compile('<td>(.*?)</td>', re.I | re.S)
>>> for each in pattern.findall(html):
 print each


JDICOM
Thu Sep 16 10:13:34 CDT 2010
>>> 
Manoj Govindan
Thank you very much
u3050