ansaurus

Question

RegExp - extract value a tag for [python]

Answer 1

+7 A:

You don't.

Regular Expressions are not well suited to deal with the nested structure of HTML. Use an HTML parser instead.

Jens 2010-06-23 10:44:36

Answer 2

+1 A:

Try this...

<a.*<b>(.*)</b>(.*)</a>

$1 and $2 should be what you want, or whatever means Python has for printing captured groups.

Adrian Regan 2010-06-23 10:48:29

Python, not PHP...

msanders 2010-06-23 11:14:48

Answer 3

+1 A:

Your question was very hard to understand, but from the given output example it looks like you want to strip everything within < and > from the input text. That can be done like so:

import re
input_text = '<a bob>i <b>c</b></a>'
output_text = re.sub('<[^>]*>', '', input_text)
print output_text

Which gives you:

i c

If that is not what you want, please clarify.

Please note that the regular expression approach for parsing XML is very brittle. For instance, the above example would break on the input <a name="b>c">hey</a>. (> is a valid character in a attribute value: see XML specs)

Deestan 2010-06-23 10:49:28

Answer 4

A:

+1 for Jens's answer. lxml is a good library you can use to actually parse this in a robust fashion. If you'd prefer something in the standard library, you can use sax, dom or elementree.

Noufal Ibrahim 2010-06-23 10:54:13

Answer 5

+5 A:

Don't use regular expressions for parsing HTML. Use an HTML parser like BeautifulSoup. Just look how easy it is:

from BeautifulSoup import BeautifulSoup
html = r'<a href="removed because it was too long"><b>LG</b> X110</a>'
soup = BeautifulSoup(html)
print ''.join(soup.findAll(text=True))
# LG X110

DzinX 2010-06-23 10:59:43

ansaurus

tags:

views:

answers:

RegExp - extract value a tag for [python]

related questions