ansaurus

Question

If you are really good at Python and Regex, please help fix function.

Answer 1

+1 A:

What your description told me is that A) The input variable isn't being correctly split by spaces and B) It's not being properly escaped.

I think it may be an instance of under parenthesized expressions:

Try this:

return mark_safe(re.compile((r"\b(%s)\b" % ("|".join(map(re.escape, needle.split(' '))), re.I))).sub(r'<strong>\1</strong>', text))

amphetamachine 2010-02-20 01:35:09

Recommending more parenthesis is cruel.

aehiilrs 2010-02-20 02:18:35

Answer 2

+1 A:

Here's some duct tape I added to help you pass the cases you listed -- this problem is actually pretty interesting. There are probably some cases that won't be correct (e.g. Google will highlight duck if you search for ducks, this will only work for duck's).

Without a more general set of guidelines, it is tough to write a regex that will cover every case - but depending on how close you need it to be will ultimately decide on how complex you will need to make it.

import re, string

def boldword(text,needle):
    n = re.sub('[%s]s*' % re.escape(string.punctuation), '', needle)
    patterns = map(re.escape, n.split(' '))
    patterns.append(n.replace(' ', ''))
    regex = re.compile(r"\b(%s)\b" % '|'.join(patterns), re.I)
    match = re.match(regex, text.replace(' ',''))
    if match:
        return "<strong>%s</strong>" % text
    return re.sub(regex, r'<strong>\1</strong>', text)

print boldword("www.theshow.com", "the show")
print boldword("my show (video)", "my show (video)")
print boldword("apple ipad", "apple's ipad")
print boldword("stack overflow", "stackoverflow")

Output

>> www.<strong>theshow</strong>.com
>> <strong>my</strong> <strong>show</strong> (<strong>video</strong>)
>> <strong>apple ipad</strong>
>> <strong>stack overflow</strong>

swanson 2010-02-20 01:55:53

Answer 3

+2 A:

Your biggest problem seems to be the word boundaries. If the tokens you're searching for can begin or end with non-word characters (e.g., (video)), enclosing the regex in \b prevents matching. They also prevent matching of two or more contiguous tokens (e.g., theshow in www.theshow.com). However, instead of losing the word boundaries, I suggest you ignore punctuation characters in the search expression and construct the regex so as to match one or more tokens at a time:

re.compile(r"\b((?:%s)+)\b" % "|".join(re.split(r"\W+", needle)), re.I)

Splitting on /\W+/ removes all punctuation as well as whitespace, so there's no need to escape anything. My results seem to match the ones you wanted, except the parentheses in (video) are not highlighted, only the word video is. If the search expression is "the show", it highlights theshow in www.theshow.com, but not in www.footheshowbar.com.

Alan Moore 2010-02-20 02:17:54

ansaurus

tags:

views:

answers:

If you are really good at Python and Regex, please help fix function.

related questions