views:

118

answers:

3

I am trying to take a block of text and scan it for specific text (name) and turn it into a URL, specifically:

Block of text:

Chairman Joe Smith has announced a new plan to decrease expenditures by outsourcing the planning of the new dining hall. Smith states the current project managers do not have excess time to commit to this new project and this will be less costly than hiring a new or contract project manager.

Now what I am trying to do is take any instances of Chairman Joe Smith, or Joe Smith, or Smith, or Chairman Smith and put that in a link to his profile/bio. Using any string methods that I know of (string replace, string builder, add text to before and after a matching string) I would run into the problem when scanning for Smith and then any of the other names.

If I try the following:

String.replace("Smith", "<a href='smithbio.html'>Smith</a>")
String.replace("Chairman Joe Smith", "<a href='smithbio.html'>Chairman Joe Smith</a>")

It would be bad because anywhere Smith is used as a part of his name fail and only Smith would be the link.

But if I try the opposite:

String.replace("Chairman Joe Smith", "<a href='smithbio.html'>Chairman Joe Smith</a>")
String.replace("Smith", "<a href='smithbio.html'>Smith</a>")

This will created nested links.

I am thinking maybe I should be using regex.Replace in combination with substring checks? But if so I am having trouble coming up with how to do it. How can I do these multiple replaces but say replace this string unless it's part of another string.replacer that is being called? FYI Doing this in VB, don't think it matters here but just in case...

+3  A: 

You should use a regex, like this: (VB, tested)

Regex.Replace(str, "(Chairman\s+)?(Joe\s+)?Smith", _
    "<a href='smithbio.html'>$0</a>")

$0 is one of several expressions that can be included in the replacement string.

If you only know the names at runtime, you should make sure to call Regex.Escape.

SLaks
Thanks! This example and that link were very helpful. Let's say Chairman Joe Smith has multiple titles and I want to parse for any mention of them, would the experssion then be: `"((Honored\s+)|(Title2\s+)|(Chairman\s+)?(Joe\s+)?Smith"` ?
sah302
@sah302: Wrap all of the titles in an outer set of parentheses or the `|` will be too greedy. `"(Honored\s+|Title2\s+|Chairman\s+)?(Joe\s+)?Smith"` (Tested)
SLaks
+1  A: 

One of the things you can do with .NET regex objects is to replace a match with the result of a delegate passed to Regex.Replace.

In the delegate you can use the result of the match (and any surrounding string you wish) in determining the replacement text (returned from the delegate).

Richard
A: 

I'm not suggesting that you do this, however, it is imperative that programers be able to deduce and reason through algorithms from problems they are presented with, especially when maintaining legacy code bases. We've gotten spoiled with all of the high level abstractions. We simply ask, how can I do X, Y, Z and boom we throw a RegEx or a LINQ query at it. I'm not saying those are bad things, but every once in while it pays to think a little deeper. Perhaps this is meant more for a Code Golf or something, but had the OP presented a reasoned through algorithm, I would feel a lot better about presenting a canned approach because then the OP would probably have recognized on their own that regex might work as a solution.

Without using a regex you could record and maintain a list of the indices and lengths of the found strings:

  1. Chairman Joe Smith
  2. Joe Smith
  3. Smith
  4. Chairman Smith

You would then go through that list, replacing each item with its respective link.

I'm sure this algorithm can be improved.

Here's an example:

class Instance
{
    public int start;
    public int length;
    public string text;
    public Instance(int _start, int _length, string _text)
    {
        start = _start;
        length = _length;
        text = _text;
    }
};

static void Main(string[] args)
{
    string test = "Chairman Joe Smith has announced a new plan to decrease expenditures by outsourcing the planning of the new dining hall. Smith states the current project managers do not have excess time to commit to this new project and this will be less costly than hiring a new or contract project manager.";
    string[] lookup = { "Chairman Joe Smith", "Joe Smith", "Smith", "Chairman Smith" };
    List<Instance> li = new List<Instance>();

    // record each instance of specified strings
    foreach (string name in lookup)
    {
        int index = 0;
        do
        {
            index = test.IndexOf(name, index);
            if (index > -1)
            {
                li.Add(new Instance(index, name.Length, name));
                index += name.Length;
            }
        } while (index > -1);
    }

    // eliminate duplicate instances
    Retry:
    foreach (Instance i in li)
    {
        foreach (Instance j in li)
        {
            if (j != i)
            {
                if ((j.start >= i.start) && (j.start + j.length <= i.start + i.length))
                {
                    li.Remove(j);
                    goto Retry;
                }
            }
        }
    }

    // replace each instance with respective text
    foreach (Instance i in li)
    {
        test = test.Remove(i.start, i.length);
        string final = "<a href='smithbio.html'>" + i.text + "</a>";
        test = test.Insert(i.start, final);
        foreach (Instance j in li)
        {
            j.start += (final.Length - i.length);
        }
    }

    Console.WriteLine(test);
    Console.ReadLine();
}