tags:

views:

63

answers:

4

Hello. Help please to make from the string like:

<a href="http://testsite.com" class="className">link_text_part1 <em>another_text</em> link_text_part2</a>

string like:

link_text_part1 another_text link_text_part2

using regular expressions in Python

!note testsite.com changes

+1  A: 

So you want to remove the <a> and <em> tags? That can be done like this:

>>> s = '<a href="http://testsite.com" class="className">link_text_part1 <em>another_text</em> link_text_part2</a>'

>>> re.sub("</?(a|em).*?>", "", s)
'link_text_part1 another_text link_text_part2'

In English, this searches for:

  • A < character
  • optionally followed by a / (to get the closing tags)
  • followed by 'a' or 'em'
  • followed by anything up to and including the first > character

and replaces them with empty strings.

However as Kos says, using regex to parse HTML is highly risky and fragile, unless you know that the format of the HTML you are parsing will never change.

Dave Kirby
thanks. but that's didn't help with scrapy
Gennadich
+1  A: 
string = re.sub('<[^>]+>', '', string)
bluesmoon
thank you. but that's didn't help too
Gennadich
you probably need a global flag.
bluesmoon
A: 

Parsing HTML with regular expressions, even for simple cases, is generally strongly unrecommended. You'll never know when you hit some HTML code which will confuse your regex.

A light HTML parser is generally a more reliable and more elegant solution.

Kos
thanks, I'll remember that
Gennadich
A: 

BTW. This helped:

from scrapy.utils.markup import remove_tags  
...
bbb=remove_tags(aaa)
Gennadich