So you want to remove the <a>
and <em>
tags? That can be done like this:
>>> s = '<a href="http://testsite.com" class="className">link_text_part1 <em>another_text</em> link_text_part2</a>'
>>> re.sub("</?(a|em).*?>", "", s)
'link_text_part1 another_text link_text_part2'
In English, this searches for:
- A < character
- optionally followed by a / (to get the closing tags)
- followed by 'a' or 'em'
- followed by anything up to and including the first > character
and replaces them with empty strings.
However as Kos says, using regex to parse HTML is highly risky and fragile, unless you know that the format of the HTML you are parsing will never change.