So what I'm trying to do is replace a string "keyword" with
"<b>keyword</b>"
in a larger string.
Example:
myString = "HI there. You should higher that person for the job. Hi hi."
keyword = "hi"
result I would want would be:
result = "<b>HI</b> there. You should higher that person for the job.
<b>Hi</b> <b>hi</b>."
I will not know what the keyword until the user types the keyword and won't know the corpus (myString) until the query is run.
I found a solution that works most of the time, but has some false positives,
namely it would return "<b>hi<b/>gher"
which is not what I want. Also note that I
am trying to preserve the case of the original text, and the matching should take
place irrespective of case. so if the keyword is "hi" it should replace
HI with <b>HI</b> and hi with <b>hi</b>.
The closest I have come is using a slightly derived version of this: http://code.activestate.com/recipes/576715/ but I still could not figure out how to do a second pass of the string to fix all of the false positives mentioned above.
Or using the NLTK's WordPunctTokenizer (which simplifies some things like punctuation) but I'm not sure how I would put the sentences back together given it does not have a reverse function and I want to keep the original punctuation of myString. Essential, doing a concatenation of all the tokens does not return the original string. For example I would not want to replace "7 - 7" with "7-7" when regrouping the tokens into its original text if the original text had "7 - 7".
Hope that was clear enough. Seems like a simple problem, but its a turned out a little more difficult then I thought.