tags:

views:

1324

answers:

5

I have some site content that contains abbreviations. I have a list of recognised abbreviations for the site, along with their explanations. I want to create a regular expression which will allow me to replace all of the recognised abbreviations found in the content with some markup.

For example:

content:

This is just a little test of the memb to see if it gets picked up. 
Deb of course should also be caught here.

abbreviations:

memb = Member; deb = Debut; 

result:

This is just a little test of the [a title="Member"]memb[/a] to see if it gets picked up. 
[a title="Debut"]Deb[/a] of course should also be caught here.

(This is just example markup for simplicity).

Thanks.

EDIT:

CraigD's answer is nearly there, but there are issues. I only want to match whole words. I also want to keep the correct capitalisation of each word replaced, so that deb is still deb, and Deb is still Deb as per the original text. For example, this input:

This is just a little test of the memb. 
And another memb, but not amemba. 
Deb of course should also be caught here.deb!
+6  A: 

First you would need to Regex.Escape() all the input strings.

Then you can look for them in the string, and iteratively replace them by the markup you have in mind:

string abbr      = "memb";
string word      = "Member";
string pattern   = String.Format("\b{0}\b", Regex.Escape(abbr));
string substitue = String.Format("[a title=\"{0}\"]{1}[/a]", word, abbr);
string output    = Regex.Replace(input, pattern, substitue);

EDIT: I asked if a simple String.Replace() wouldn't be enough - but I can see why regex is desirable: you can use it to enforce "whole word" replacements only by making a pattern that uses word boundary anchors.

You can go as far as building a single pattern from all your escaped input strings, like this:

\b(?:{abbr_1}|{abbr_2}|{abbr_3}|{abbr_n})\b

and then using a match evaluator to find the right replacement. This way you can avoid iterating the input string more than once.

Tomalak
I choose this as the selected answer because it works for my extended requirements (in the edit). I built a single pattern and used a match evaluator as suggested, and it works very well and without a foreach loop too. Thanks Tomalak!
David Conlisk
(I posted the final solution below)
David Conlisk
+1  A: 

I doubt it will perform better than just doing normal string.replace, so if performance is critical measure (refactoring a bit to use a compiled regex). You can do the regex version as:

var abbrsWithPipes = "(abbr1|abbr2)";
var regex = new Regex(abbrsWithPipes);
return regex.Replace(html, m => GetReplaceForAbbr(m.Value));

You need to implement GetReplaceForAbbr, which receives the specific abbr being matched.

eglasius
+2  A: 

Hi David,

Not sure how well this will scale to a big word list, but I think it should give the output you want (although in your question the 'result' seems identical to 'content')?

Anyway, let me know if this is what you're after

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = @"This is just a little test of the memb to see if it gets picked up. 
Deb of course should also be caught here.";
            var dictionary = new Dictionary<string,string>
            {
                {"memb", "Member"}
                ,{"deb","Debut"}
            };
            var regex = "(" + String.Join(")|(", dictionary.Keys.ToArray()) + ")";
            foreach (Match metamatch in Regex.Matches(input
               , regex  /*@"(memb)|(deb)"*/
               , RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
            { 
                input = input.Replace(metamatch.Value, dictionary[metamatch.Value.ToLower()]);
            }
            Console.Write (input);
            Console.ReadLine();
        }
    }
}
CraigD
Thanks - very helpful! Do you know how to alter it slightly so that it only matches whole words, i.e. matches memb but not amemba ?
David Conlisk
I see this was answered elsewhere - glad you got what you were after.I thought the loop looking awkward - now that someone else has posted `MatchEvaluator` I do recall using it before. Much nicer!
CraigD
A: 

I'm doing pretty exactly what you're looking for in my application and this works for me: the parameter str is your content:

public static string GetGlossaryString(string str)
        {
            List<string> glossaryWords = GetGlossaryItems();//this collection would contain your abbreviations; you could just make it a Dictionary so you can have the abbreviation-full term pairs and use them in the loop below 

            str = string.Format(" {0} ", str);//quick and dirty way to also search the first and last word in the content.

            foreach (string word in glossaryWords)
                str = Regex.Replace(str, "([\\W])(" + word + ")([\\W])", "$1<span class='glossaryItem'>$2</span>$3", RegexOptions.IgnoreCase);

            return str.Trim();
        }
Stefan
+1  A: 

For anyone interested, here is my final solution. It is for a .NET user control. It uses a single pattern with a match evaluator, as suggested by Tomalak, so there is no foreach loop. It's an elegant solution, and it gives me the correct output for the sample input while preserving correct casing for matched strings.

public partial class Abbreviations : System.Web.UI.UserControl
{
    private Dictionary<String, String> dictionary = DataHelper.GetAbbreviations();

    protected void Page_Load(object sender, EventArgs e)
    {
        string input = "This is just a little test of the memb. And another memb, but not amemba to see if it gets picked up. Deb of course should also be caught here.deb!";

        var regex = "\\b(?:" + String.Join("|", dictionary.Keys.ToArray()) + ")\\b";

        MatchEvaluator myEvaluator = new MatchEvaluator(GetExplanationMarkup);

        input = Regex.Replace(input, regex, myEvaluator, RegexOptions.IgnoreCase);

        litContent.Text = input;
    }

    private string GetExplanationMarkup(Match m)
    {
        return string.Format("<b title='{0}'>{1}</b>", dictionary[m.Value.ToLower()], m.Value);
    }
}

The output looks like this (below). Note that it only matches full words, and that the casing is preserved from the original string:

This is just a little test of the <b title='Member'>memb</b>. And another <b title='Member'>memb</b>, but not amemba to see if it gets picked up. <b title='Debut'>Deb</b> of course should also be caught here.<b title='Debut'>deb</b>!
David Conlisk