ansaurus

Question

Python regular expressions: search and replace weirdness

Answer 1

+1 A:

Use (?<=...) and (?=...) to match parts of the string but not replace them:

re.sub("(?<=s )(.*?)(?= s)", "no", "this is a string")

EDIT: This returns this no string, so not quite what you want... :-(

For your updated question, try this:

re.sub(r"(?<=href=['\"])((?!http).*?)(?=['\"].*?>)", 'test', string)

Isn't it enough to check href=" before a link?

eumiro 2010-09-22 13:34:49

You are absolutely right.

Bolhoed 2010-09-22 14:02:13

Answer 2

+2 A:

Your regex matches everything from the first s to the last s, so if you replace the match with "no", you get "thinotring".

The parentheses don't limit the match, they capture the text matched by whatever is inside them in a special variable called backreference. In your example, backreference number 1 would contain is a. You can refer to a backreference later in the same regex using backslashes and the number of the backreference: \1.

What you probably want is lookaround:

re.sub(r"(?<=s ).*?(?= s)", "no", "this is a string")

(?<=s ) means: Assert that it is possible to match s before the current position in the string, but don't make it part of the match.

Same for (?= s), but it asserts that the string will continue with s after the current position.

Be advised that lookbehind in Python is limited to strings of fixed length. So if that is a problem, you can sort of work around this using...backreferences!

re.sub(r"(s ).*?( s)", r"\1no\2", "this is a string")

OK, this is a contrived example, but it shows what you can do. From your edit, it's becoming apparent that you're trying to parse HTML with regex. Now that is not such a good idea. Search SO for "regex html" and you'll see why.

If you still want to do it:

re.sub(r"(<a.*?href=['"])((?!http).*?['"].*?>)", r'\1http://\2', string)

might work. But this is extremely brittle.

Tim Pietzcker 2010-09-22 13:35:22

Sadly that won't work, see my new example in the edited question.

Bolhoed 2010-09-22 13:39:08

I came to the same solution as you and eumiro did. Thanks all!

Bolhoed 2010-09-22 14:05:07

Answer 3

A:

Ok, look-around was possible, just needed a small rewrite. This works:

def absolutize(string, prefix):
    return re.sub(r"(?<=href=['\"])((?!http).*?)(?=['\"])", prefix+r'\1', string)

Still, stupid Python regex system... :(

Bolhoed 2010-09-22 13:44:32

Answer 4

A:

Your expression, while nasty looking, does work but you are not capturing the result of re.sub which returns the replaced string and doesn't perform the replacement on the string passed as a parameter.

import re

new_string = re.sub(r"<a.*?href=['\"]((?!http).*?)['\"].*?>", 'test', string)
print new_string

Check it here on IDEone.com: http://ideone.com/ufaTw

BTW, you're probably better off using Beautiful Soup or similar to systematically search and replace HTML, using regex is a bad idea.

Martin Thomas 2010-09-22 14:00:36

ansaurus

tags:

views:

answers:

Python regular expressions: search and replace weirdness

related questions