views:

332

answers:

2

Hey,

I am processing my website and wanting to change some things on the pages.

I am wanting to replace the following string:

in the
<SPAN class="Bold">
More...
</SPAN>
column to your right.

Some times is does not have the <span> tags :

in the
More...
column to your right.

I would like to replace this with "below". I tried doing this with a simple replace() in python but because sometime the text does not have the <span> tag and is on multiple lines it does not seem to work. My only thought is using regular expressions but I am not up to speed with regex's, could anyone lend a hand?

Thanks

Eef

+2  A: 

Assuming you have the html text in the string "foo", the code to do this in Python would be like:

import re
#re.DOTALL is used to make the . match all characters including newline
regexp = re.compile('in the.*?More\.\.\..*?column to your right\.', re.DOTALL)
re.sub(regexp, 'below', foo)
Jared Siirila
After running this over more than 2 HTML pages I get a stackoverflow error, I think it may be the re.DOTALL to handle the hard returns. :-/
Eef
@Eef: As far as I can tell, Jared's solution should work. I can't reproduce your stack overflow message. DOTALL is extremely unlikely to be causing a stack overflow. It merely does what Jared said. It is necessary to match anything (including newlines) between 'in the' and 'More'. Please show us the code that implements Jared's solution, plus the full traceback and error message.
John Machin
This solution worked, I was having problems with my environment, which has been resolved and not giving the issue anymore. Cheers
Eef
A: 

Try this:

import re
pattern = re.compile('(?:<SPAN class="Bold">\s*)?More\.\.\.(?:\s*</SPAN>)?')
str = re.sub(pattern, 'below', str)

The (?:…) syntax is a non-capturing grouping which cannot be referenced as a backreference.

Gumbo