views:

676

answers:

2

I'm trying to use python's re.sub function to replace some text.

>>> import re
>>> text = "<hi type=\"italic\"> the></hi>"
>>> pat_error = re.compile(">(\s*\w*)*>")
>>> pat_error.search(text)
<_sre.SRE_Match object at 0xb7a3fea0>
>>> re.sub(pat_error, ">\1", text)
'<hi type="italic">\x01</hi>'

Afterwards the value of text should be

"<hi type="italic"> the</hi>"
A: 
>>> text.replace("><", "<")
'<hi type="italic"> the</hi>'
ghostdog74
This won't work because there are other instances where the value of text might be "<tag>stuff</tag><tag>blah</tag>"
Daniel
+2  A: 

Two bugs in your code. First, you're not matching (and specifically, capturing) what you think you're matching and capturing -- insert after your call to .search:

>>> _.groups()
('',)

The unconstrained repetition of repetitions (star after a capturing group with nothing but stars) matches once too many -- with the empty string at the end of what you think you're matchin -- and that's what gets captured. Fix by changing at least one of the stars to a plus, e.g., by:

>>> pat_error = re.compile(r">(\s*\w+)*>")
>>> pat_error.search(text)
<_sre.SRE_Match object at 0x83ba0>
>>> _.groups()
(' the',)

Now THIS matches and captures sensibly. Second, youre not using raw string literal syntax where you should, so you don't have a backslash where you think you have one -- you have an escape sequence \1 which is the same as chr(1). Fix by using raw string literal syntax, i.e. after the above snippet

>>> pat_error.sub(r">\1", text)
'<hi type="italic"> the</hi>'

Alternatively you could double up all of your backslashes, to avoid them being taken as the start of escape sequences -- but, raw string literal syntax is much more readable.

Alex Martelli