views:

964

answers:

4

So what I'm trying to do is replace a string "keyword" with "<b>keyword</b>" in a larger string.

Example:

myString = "HI there. You should higher that person for the job. Hi hi."

keyword = "hi"

result I would want would be:

result = "<b>HI</b> there. You should higher that person for the job. <b>Hi</b> <b>hi</b>."

I will not know what the keyword until the user types the keyword and won't know the corpus (myString) until the query is run.

I found a solution that works most of the time, but has some false positives, namely it would return "<b>hi<b/>gher" which is not what I want. Also note that I am trying to preserve the case of the original text, and the matching should take place irrespective of case. so if the keyword is "hi" it should replace HI with <b>HI</b> and hi with <b>hi</b>.

The closest I have come is using a slightly derived version of this: http://code.activestate.com/recipes/576715/ but I still could not figure out how to do a second pass of the string to fix all of the false positives mentioned above.

Or using the NLTK's WordPunctTokenizer (which simplifies some things like punctuation) but I'm not sure how I would put the sentences back together given it does not have a reverse function and I want to keep the original punctuation of myString. Essential, doing a concatenation of all the tokens does not return the original string. For example I would not want to replace "7 - 7" with "7-7" when regrouping the tokens into its original text if the original text had "7 - 7".

Hope that was clear enough. Seems like a simple problem, but its a turned out a little more difficult then I thought.

+2  A: 

This ok?

>>> import re
>>> myString = "HI there. You should higher that person for the job. Hi hi."
>>> keyword = "hi"
>>> search = re.compile(r'\b(%s)\b' % keyword, re.I)
>>> search.sub('<b>\\1</b>', myString)
'<b>HI</b> there. You should higher that person for the job. <b>Hi</b> <b>hi</b>.'

The key to the whole thing is using word boundaries, groups and the re.I flag.

Paolo Bergantino
This is pretty much what I wanted. I might have to edit what constitutes a word boundary is as stated by Dave B, but that should be easy to edit and I would have to look though the data and figure that out later (if I need to). Otherwise this is exactly what I needed and I'm sure covers all cases I could come up with. Thanks.
Johnny4000
A: 

You should be able to do this very easily with re.sub using the word boundary assertion \b, which only matches at a word boundary:

import re

def SurroundWith(text, keyword, before, after):
  regex = re.compile(r'\b%s\b' % keyword, re.IGNORECASE)
  return regex.sub(r'%s\0%s' % (before, after), text)

Then you get:

>>> SurroundWith('HI there. You should hire that person for the job. '
...              'Hi hi.', 'hi', '<b>', '</b>')
'<b>HI</b> there. You should hire that person for the job. <b>Hi</b> <b>hi</b>.'

If you have more complicated criteria for what constitutes a "word boundary," you'll have to do something like:

def SurroundWith2(text, keyword, before, after):
  regex = re.compile(r'([^a-zA-Z0-9])(%s)([^a-zA-Z0-9])' % keyword,
                     re.IGNORECASE)
  return regex.sub(r'\1%s\2%s\3' % (before, after), text)

You can modify the [^a-zA-Z0-9] groups to match anything you consider a "non-word."

I ranbefore = '<b>'after ='</b>'text = "HI there. You should higher that person for the job. Hi hi."keyword = 'hi'print 'result = ', SurroundWith( text , keyword, before, after)and got result = <b>
Johnny4000
A: 

I think the best solution would be regular expression...

import re
def reg(keyword, myString) :
   regx = re.compile(r'\b(' + keyword + r')\b', re.IGNORECASE)
   return regx.sub(r'<b>\1</b>', myString)

of course, you must first make your keyword "regular expression safe" (quote any regex special characters).

Francis
+1  A: 

Here's one suggestion, from the nitpicking committee. :-)

myString = "HI there. You should higher that person for the job. Hi hi."

myString.replace('higher','hire')