ansaurus

Question

Need help with the regular expressions in Python

Answer 1

+1 A:

So you want to remove the <a> and <em> tags? That can be done like this:

>>> s = '<a href="http://testsite.com" class="className">link_text_part1 <em>another_text</em> link_text_part2</a>'

>>> re.sub("</?(a|em).*?>", "", s)
'link_text_part1 another_text link_text_part2'

In English, this searches for:

A < character
optionally followed by a / (to get the closing tags)
followed by 'a' or 'em'
followed by anything up to and including the first > character

and replaces them with empty strings.

However as Kos says, using regex to parse HTML is highly risky and fragile, unless you know that the format of the HTML you are parsing will never change.

Dave Kirby 2010-07-23 10:37:25

thanks. but that's didn't help with scrapy

Gennadich 2010-07-23 11:17:59

Answer 2

+1 A:

string = re.sub('<[^>]+>', '', string)

bluesmoon 2010-07-23 10:43:45

thank you. but that's didn't help too

Gennadich 2010-07-23 11:35:18

you probably need a global flag.

bluesmoon 2010-07-23 22:21:45

Answer 3

A:

Parsing HTML with regular expressions, even for simple cases, is generally strongly unrecommended. You'll never know when you hit some HTML code which will confuse your regex.

A light HTML parser is generally a more reliable and more elegant solution.

Kos 2010-07-23 10:43:53

thanks, I'll remember that

Gennadich 2010-07-23 11:18:50

Answer 4

A:

BTW. This helped:

from scrapy.utils.markup import remove_tags  
...
bbb=remove_tags(aaa)

Gennadich 2010-07-25 14:35:08

ansaurus

tags:

views:

answers:

Need help with the regular expressions in Python

related questions