views:

417

answers:

4

I'm working on a search function for an MVC C# app that will place a (possibly large) chunk of text through a filter, and given the search query, will place an html <span> with a highlighted style before and after each search term.

I've got a simple algorithm working, but I've got a feeling it will be slow, probably because of the amount of strings that will need to be created (2 * the number of matches).

public static string Surround(string original, string head, string tail, string match, StringComparison comparer)
{
 var ret = original;

 if (ret.IndexOf(match, 0, comparer) != -1)
 {
  var lastIndex = 0;

  while ((lastIndex = ret.IndexOf(match, lastIndex, comparer)) != -1)
  {
   ret = ret.Insert(lastIndex, head);
   var tailIndex = lastIndex + match.Length + head.Length;
   lastIndex = tailIndex;
   ret = ret.Insert(tailIndex, tail);
  }
 }

 return ret;
}

I'm wondering if anyone can give some hints for a better algorithm that would perform better for large chunks of text? I was thinking of using a stringbuilder, but it's also occurred to me that I could be approaching this from entirely the wrong way. Any insight would be greatly appreciated.

+1  A: 

A StringBuilder will of course be much faster, but never the less when you do an .Insert you will be moving around a whole lot of data each time. So it would be better to build up the result in the StringBuilder using only .Append. Not .Insert.

However, you should also be able to use a RegularExpression for this, although I don't know the syntax by heart (I use a wonderful tool called RegEx Buddy to build my regular expressions when I have the need.).

EDIT: Very few people in this world have the ability to distinguish a regular expression from tranismission line noise. :-)

danbystrom
I was thinking about a regular expression, but I would need to do some benchmarks before I implemented it. For a large body of text, I think it might be even slower than using all the inserts.
Alex Fort
Agreed. You must benchmark it. Please tell us what you find! :-)
danbystrom
I'm pretty shocked by my benchmark results so far. I'm showing 324ms to match/insert 4000 matches with the string insert method, and only 4ms for the same number of matches with a regex. Let me check to make sure everything is measuring properly again, but it looks promising.
Alex Fort
For completeness, I'd be nice to know what an append-only version with StringBuilder would take!
danbystrom
+6  A: 

A regular expression will do the job, and the code should be a lot simpler. However you'd need to test to determine if it actually delivers better performance. Something like this:

public static string Surround(
    string original, string head, string tail, string match)
{
    return Regex.Replace(
        original, match, head + "$0" + tail, RegexOptions.IgnoreCase);
}

Even better if you can pass in the replacer whole as you save 2N string concats:

public static string Surround(string original, string replacer, string match)
{
    return Regex.Replace(original, match, replacer, RegexOptions.IgnoreCase);
}

Surround("foo bar baz", "<span>$&</span>", "bar");  //call like so
LukeH
I'm not sure if the regex really needs to be so complex, but yeah this is the right approach
annakata
@annakata, Yeah, just noticed that and fixed it before I saw your comment :)
LukeH
The only reason I'm not using Replace, is it needs to be case-insensitive. IE, if I wanted to place a "|" character before and after a match, it would need to display "|test|", as well as "|TeSt|" for the query "test."
Alex Fort
+1...as this was my first thought as to how to solve the problem :D
ee
No worries, I hope you don't mind I added the next logical iteration, at which point of course you basically have just a wrapper on Replace
annakata
i am pretty sure that you can have the query case insensitive and preserve the case in the output...just some variations on the regex expression/settings
ee
@Alex, @ee, RegexOptions.IgnoreCase
LukeH
I've checked all my benchmarks, and for 40,000 matches, my string insert method takes 46,300ms (!) and the regex method you posted only takes 33ms (!!). Thanks for all your help guys, this is a pretty great result for me. :)
Alex Fort
A: 

Luke's answer is good, but you might want to escape any special characters in "match" first in case someone uses a .\^$? etc in their search (unless you want to allow that of course).

ETA: Allowing special characters would allow for a more powerful search, but would result in unexpected output as well, since the found text would be replaced with the pattern rather than the actual matching text.

Eric Petroelje
@Eric, I was originally going to put that in the example, but I thought it would clutter it. I should've mentioned it in the comments though.
LukeH
A: 

The RegEx solution is definitely more elegant than what I've produced. However, using StringBuilder is generally a little faster and doesn't need the search term and the pre-/post-fixes to be regex-escaped.

private static string Surround(string original, string head, string tail, string match, StringComparison comparisonType)
{
  if (string.IsNullOrEmpty(original) || string.IsNullOrEmpty(match) || (string.IsNullOrEmpty(head) && string.IsNullOrEmpty(tail)))
    return original;

  var resultBuilder = new StringBuilder();
  int matchLength = match.Length;
  int lastIdx = 0;

  for (;;)
  {
    int curIdx = original.IndexOf(match, lastIdx, comparisonType);

    if (curIdx > -1)
      resultBuilder
        .Append(original, lastIdx, curIdx - lastIdx)
        .Append(head)
        .Append(original, curIdx, matchLength)
        .Append(tail);
    else
      return resultBuilder.Append(original.Substring(lastIdx)).ToString();

    lastIdx = curIdx + matchLength;
  }
}

I hope you can use it.

Update Doing a little perf test it seems that my solution is only faster if you are searching for "short" words. If the word is long (ie. length > 7) Regex wins, but if the word is short (ie. length < 7) then my solution wins. Anyone know why this is? What operation is so sensitive to length? Is it IndexOf(string, int, int) or maybe Append(string, int, int)?

JohannesH
One optimization that would probably help is to initialize resultBuilder with a more generous initial capacity (such as new StringBuilder(2 * original.Length).
C. Dragon 76