Two bugs in your code. First, you're not matching (and specifically, capturing) what you think you're matching and capturing -- insert after your call to .search
:
>>> _.groups()
('',)
The unconstrained repetition of repetitions (star after a capturing group with nothing but stars) matches once too many -- with the empty string at the end of what you think you're matchin -- and that's what gets captured. Fix by changing at least one of the stars to a plus, e.g., by:
>>> pat_error = re.compile(r">(\s*\w+)*>")
>>> pat_error.search(text)
<_sre.SRE_Match object at 0x83ba0>
>>> _.groups()
(' the',)
Now THIS matches and captures sensibly. Second, youre not using raw string literal syntax where you should, so you don't have a backslash where you think you have one -- you have an escape sequence \1
which is the same as chr(1). Fix by using raw string literal syntax, i.e. after the above snippet
>>> pat_error.sub(r">\1", text)
'<hi type="italic"> the</hi>'
Alternatively you could double up all of your backslashes, to avoid them being taken as the start of escape sequences -- but, raw string literal syntax is much more readable.