tags:

views:

147

answers:

3
def boldword(text, needle):
    return mark_safe(re.compile(r"\b(%s)\b" % "|".join(map(re.escape, needle.split(' '))), re.I).sub(r'<strong>\1</strong>', text))

This is currently my function to bold a string text given a needle. (Like Google...they bold the text for you when you do a search).

  • When the needle is "the show", it will not highlight "www.theshow.com".
  • When the needle is "my show (video)", it will not highlight "my show (video)"...it only highlights my show.
  • When the needle is "apple's ipad", it will not highlight "apple ipad"...it only highlights ipad.

    Expected output: www.theshow.com ,Current output: www.theshow.com

    Expected output: my show (video) ,Current output: my show (video)

    Expected: apple ipad ,Current: apple ipad

I think the main problem is when I'm splitting the space vs other punctuation. right? Can someone modify my current function to take into account those factors?

Thanks

+1  A: 

What your description told me is that A) The input variable isn't being correctly split by spaces and B) It's not being properly escaped.

I think it may be an instance of under parenthesized expressions:

Try this:

return mark_safe(re.compile((r"\b(%s)\b" % ("|".join(map(re.escape, needle.split(' '))), re.I))).sub(r'<strong>\1</strong>', text))
amphetamachine
Recommending more parenthesis is cruel.
aehiilrs
+1  A: 

Here's some duct tape I added to help you pass the cases you listed -- this problem is actually pretty interesting. There are probably some cases that won't be correct (e.g. Google will highlight duck if you search for ducks, this will only work for duck's).

Without a more general set of guidelines, it is tough to write a regex that will cover every case - but depending on how close you need it to be will ultimately decide on how complex you will need to make it.

import re, string

def boldword(text,needle):
    n = re.sub('[%s]s*' % re.escape(string.punctuation), '', needle)
    patterns = map(re.escape, n.split(' '))
    patterns.append(n.replace(' ', ''))
    regex = re.compile(r"\b(%s)\b" % '|'.join(patterns), re.I)
    match = re.match(regex, text.replace(' ',''))
    if match:
        return "<strong>%s</strong>" % text
    return re.sub(regex, r'<strong>\1</strong>', text)

print boldword("www.theshow.com", "the show")
print boldword("my show (video)", "my show (video)")
print boldword("apple ipad", "apple's ipad")
print boldword("stack overflow", "stackoverflow")

Output

>> www.<strong>theshow</strong>.com
>> <strong>my</strong> <strong>show</strong> (<strong>video</strong>)
>> <strong>apple ipad</strong>
>> <strong>stack overflow</strong>
swanson
+2  A: 

Your biggest problem seems to be the word boundaries. If the tokens you're searching for can begin or end with non-word characters (e.g., (video)), enclosing the regex in \b prevents matching. They also prevent matching of two or more contiguous tokens (e.g., theshow in www.theshow.com). However, instead of losing the word boundaries, I suggest you ignore punctuation characters in the search expression and construct the regex so as to match one or more tokens at a time:

re.compile(r"\b((?:%s)+)\b" % "|".join(re.split(r"\W+", needle)), re.I)

Splitting on /\W+/ removes all punctuation as well as whitespace, so there's no need to escape anything. My results seem to match the ones you wanted, except the parentheses in (video) are not highlighted, only the word video is. If the search expression is "the show", it highlights theshow in www.theshow.com, but not in www.footheshowbar.com.

Alan Moore