tags:

views:

216

answers:

6

Say I have these two strings: "Some Text here" and "some text Here"

And I have a collection that contains the words that I would like to match against the text in the strings. "Some", "Text", "Here"

If one of the words match a certain word in the string (regardless if it is upper- or lower-case) I would like to take the original word from the string and add some HTML markup around it like this <dfn title="Definition of word">Original word</dfn>.

I was playing around with the string.Replace() method but not sure how to get it to match regardless of case and how to still keep the original word intact (so that I don't replace "word" with <dfn title="">Word</dfn or vice versa).

+2  A: 

You might use a regular expression:

class Program {

    static string ReplaceWord(Match m) {
        return string.Format("<dfn>{0}</dfn>",m.Value);
    }

    static void Main(string[] args) {

        Regex r = new Regex("some|text|here", RegexOptions.IgnoreCase);
        string input = "Some random text.";
        string replaced = r.Replace(input, ReplaceWord);
        Console.WriteLine(replaced);
    }
}

RegexOptions.IgnoreCase is used to match words in the list regardless of their case.
The ReplaceWord function returns the matched string (correctly cased) surrounded by the opening and closing tag (note that you still might need to escape the inner string).

Paolo Tedesco
+4  A: 

Indeed, the string.Replace method is not versatile enough for your requirements in this case. Lower-level text manipulation should do the job. The alternative is of course regex, but the algorithm I present here is going to be the most efficient method, and I thought it would be helpful to write it anyway to see how you can a lot of text manipulation without regex for a change.

Here's the function.

Update:

  1. Now works with a Dictionary<string, string> instead of a string[], which enables a definition to be passed to the function along with the word.
  2. Now works with arbitrary ordering of definitions dictionary.

...

public static string HtmlReplace(string value, Dictionary<string, string>
    definitions, Func<string, string, string> htmlWrapper)
{
    var sb = new StringBuilder(value.Length);

    int index = -1;
    int lastEndIndex = 0;
    KeyValuePair<string, string> def;
    while ((index = IndexOf(value, definitions, lastEndIndex,
        StringComparison.InvariantCultureIgnoreCase, out def)) != -1)
    {
        sb.Append(value.Substring(lastEndIndex, index - lastEndIndex));
        sb.Append(htmlWrapper(def.Key, def.Value));
        lastEndIndex = index + def.Key.Length;
    }
    sb.Append(value.Substring(lastEndIndex, value.Length - lastEndIndex));

    return sb.ToString();
}

private static int IndexOf(string text, Dictionary<string, string> values, int startIndex,
    StringComparison comparisonType, out KeyValuePair<string, string> foundEntry)
{
    var minEntry = default(KeyValuePair<string, string>);
    int minIndex = -1;
    int index;
    foreach (var entry in values)
    {
        if (((index = text.IndexOf(entry.Key, startIndex, comparisonType)) < minIndex
            && index != -1) || minIndex == -1)
        {
            minIndex = index;
            minEntry = entry;
        }
    }

    foundEntry = minEntry;
    return minIndex;
}

And a small test program. (Notice the use of a lambda expression for convenience.)

static void Main(string[] args)
{
    var str = "Definition foo; Definition bar; Definition baz";
    var definitions = new Dictionary<string, string>();
    definitions.Add("foo", "Definition 1");
    definitions.Add("bar", "Definition 2");
    definitions.Add("baz", "Definition 3");
    var output = HtmlReplace(str, definitions,
        (word, definition) => string.Format("<dfn title=\"{1}\">{0}</dfn>", 
            word, definition));
}

Output text:

Definition <dfn title="Definition 1">foo</dfn>; Definition <dfn title="Definition 2">bar</dfn>; Definition <dfn title="Definition 3">baz</dfn>

Hope that helps.

Noldorin
Reason for down-vote, please?
Noldorin
I'm having some problems after changing the words array to a dictionary collection. I get everything working except retrieving the value to send in as the definition text inside the string.format method (the lambda expression). Thanks for the help.
@Frederik: No problem... You could actually just use a switch statement in the lambda expression from the previous version, but I've updated the post to show a version that uses Dictionary instead. Take whichever you prefer.
Noldorin
Great! One last little thing, you're now using word.Key and word.Value but I would like to use the original word instead of word.Key. Thanks again!
Got it working with:Func<KeyValuePair<string, string>, string, string>sb.Append(htmlWrapper(def, value.Substring(index, def.Key.Length)));and var output = HtmlReplace(str, dict, (word, value) => string.Format("<dfn title=\"{1}\">{0}</dfn>", value, word.Value));
Not sure exactly what you mean - do you just want to use `word` in the lambda expression? You could change the definition of the htmlWrapepr function to do this.
Noldorin
(Updated the post again.)
Noldorin
Excellent response, Noldorin! Bookmarked because I can see this coming in handy.
James McConnell
"Indeed, the string.Replace method is not versatile enough for your requirements in this case." Sure it is! Just use a MatchEvaluator like @orsogufo did.
Alan Moore
There is a bug (System.ArgumentOutOfRangeException) when changing the sentence to "Definition baz; Definition bar; Definition Foo"
@Alan: That's Regex.Replace you're confusing it with. :P
Noldorin
@Frederik: Yeah, you're right. The fix wasn't trivial unfortunately, but I've made it now. The function should still be very efficient. :)
Noldorin
A: 

May be I have understood your question wrongly. But why not just use regular expressions?

If you get your regex right,then they are faster, fool proof and provide indexing on the original string that will give you the exact position of the matched word, so that you can insert markup exactly at the desired location.

But note that you will have to use String.Insert() with match positions and string .replace() will not help.

Hope that answers your question.

Prashanth
A: 

The simplest way would be to use String.Replace, as you said.

I was surprised there was no option to specify StringComparisonOptions in String.Replace.

I wrote for you a "not so optimized" but very simple IgnoreCaseReplace:

static string IgnoreCaseReplace(string text, string oldValue, string newValue)
{
    int index = 0;
    while ((index = text.IndexOf(oldValue,
        index,
        StringComparison.InvariantCultureIgnoreCase)) >= 0)
    {
        text = text.Substring(0, index)
            + newValue
            + text.Substring(index + oldValue.Length);

        index += newValue.Length;
    }

    return text;
}

To make it more nice, you can wrap it in a static class and make it an extension method of String:

static class MyStringUtilities
{
    public static string IgnoreCaseReplace(this string text, string oldValue, string newValue)
    {
        int index = 0;
        while ((index = text.IndexOf(oldValue,
            index,
            StringComparison.InvariantCultureIgnoreCase)) >= 0)
        {
            text = text.Substring(0, index)
                + newValue
                + text.Substring(index + oldValue.Length);

            index += newValue.Length;
        }

        return text;
    }
}
Maghis
A: 

Regex code:

/// <summary>
/// Converts the input string by formatting the words in the dict with their meanings
/// </summary>
/// <param name="input">Input string</param>
/// <param name="dict">Dictionary contains words as keys and meanings as values</param>
/// <returns>Formatted string</returns>
public static string FormatForDefns(string input, Dictionary<string,string> dict )
{
    string formatted = input;
    foreach (KeyValuePair<string, string> kv in dict)
    {
        string definition = "<dfn title=\"" + kv.Value + "\">" + kv.Key + "</dfn>.";
        string pattern = "(?<word>" + kv.Key + ")";
        formatted = Regex.Replace(formatted, pattern, definition, RegexOptions.IgnoreCase);
    }
    return formatted;
}

This is the calling code

Dictionary<string, string> dict = new Dictionary<string, string>();
dict.Add("word", "meaning");
dict.Add("taciturn ", "Habitually silent; not inclined to talk");

string s = "word abase";
string formattedString = MyRegEx.FormatForDefns(s, dict);
Rashmi Pandit
Doing a regex replace like this multiple times (For every dictionary entry) is going to be horribly inefficient.
Noldorin
Yes, you are right.
Rashmi Pandit
You also run the risk of having your regex mistakenly match text that was added to the string by an earlier Replace(). For example, if one of the keywords was "title" you would end up replacing the "title" attribute name in any already-existing dfn elements.
Alan Moore
A: 

First, I'm going to be mean and provide a anti-answer: A test case for you that is a bugger to code against.

What happens if I have the terms:

Web Browser
Browser History

And I run it against the phrase:

Now, clean the web browser history by ...

Do you get

Now, clean the <dfn title="Definition of word">web <dfn title="Definition of word">browser</dfn> history</dfn> by ...

I've recently been wrestling with the same problem, but I don't think my solution would help you - http://github.com/jarofgreen/TaggedWiki/blob/d002997444c35cafecd85316280a896484a06511/taggedwikitest/taggedwiki/views.py line 47 onwards. I ended up putting a marker infront of the tag and not wrapping the text.

However I may have one part of the answer for you: in order to avoid catching words in the HTML ( the problem of what happens if you have a tag of "title" you identified in your last paragraph ) I did 2 passes. In the first searching pass I stored the location of the phrases to wrap, then in my second non-searching pass I put in the actual HTML. This way, there is no HTML in the text while you are doing your actual searching.

James