ansaurus

Question

Highlight a list of words using a regular expression in c#

Answer 1

+6 A:

First you would need to Regex.Escape() all the input strings.

Then you can look for them in the string, and iteratively replace them by the markup you have in mind:

string abbr      = "memb";
string word      = "Member";
string pattern   = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output    = Regex.Replace(input, pattern, substitue);

EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.

You can go as far as building a single pattern from all your escaped input strings, like this:

\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b

and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.

Tomalak 2009-03-17 10:53:42

I choose this as the selected answer because it works for my extended requirements (in the edit). I built a single pattern and used a match evaluator as suggested, and it works very well and without a foreach loop too. Thanks Tomalak!

David Conlisk 2009-03-17 13:49:48

(I posted the final solution below)

David Conlisk 2009-03-17 13:55:04

Answer 2

+1 A:

I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:

var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));

You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.

eglasius 2009-03-17 10:59:52

Answer 3

+2 A:

Hi David,

Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?

Anyway, let me know if this is what you're after

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = @"This is just a little test of the memb to see if it gets picked up. 
Deb of course should also be caught here.";
            var dictionary = new Dictionary<string,string>
            {
                {"memb", "Member"}
                ,{"deb","Debut"}
            };
            var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
            foreach (Match metamatch in Regex.Matches(input
               , regex  /*@"(memb)|(deb)"*/
               , RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
            { 
                input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
            }
            Console.Write (input);
            Console.ReadLine();
        }
    }
}

CraigD 2009-03-17 11:01:30

Thanks - very helpful! Do you know how to alter it slightly so that it only matches whole words, i.e. matches memb but not amemba ?

David Conlisk 2009-03-17 11:41:07

I see this was answered elsewhere - glad you got what you were after.I thought the loop looking awkward - now that someone else has posted `MatchEvaluator` I do recall using it before. Much nicer!

CraigD 2009-03-17 21:53:04

Answer 4

A:

I'm doing pretty exactly what you're looking for in my application and this works for me: the parameter str is your content:

public static string GetGlossaryString(string str)
        {
            List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below 

            str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.

            foreach (string word in glossaryWords)
                str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);

            return str.Trim();
        }

Stefan 2009-03-17 13:33:12

Answer 5

+1 A:

For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.

public partial class Abbreviations : System.Web.UI.UserControl
{
    private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();

    protected void Page_Load(object sender, EventArgs e)
    {
        string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";

        var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";

        MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);

        input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);

        litContent.Text = input;
    }

    private string GetExplanationMarkup(Match m)
    {
        return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
    }
}

The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:

This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!

David Conlisk 2009-03-17 13:53:45

ansaurus

tags:

views:

answers:

Highlight a list of words using a regular expression in c#

related questions