ansaurus

Question

Answer 1

A:

Regular expressions can be optimized quite a bit when you just want to match against a fixed set of constant strings. Instead of several matches, e.g. against "winter", "win" or "wombat", you can just match against "w(in(ter)?|ombat)", for example (Jeffrey Friedl's book can give you lots of ideas like this). This kind of optimisation is also built into some programs, notably emacs ('regexp-opt'). I'm not too familiar with .NET, but I assume someone has programmed similar functionality - google for "regexp optimization".

Kilian Foth 2010-03-30 13:33:06

Why should this expression be more efficient than `winter|wombat|win`? They both should get compiled down to approximately the same automaton (minus the capture groups which actually make your expression more complex). I’m not convinced and unfortunately I’m unable to find good information on the technical details. Do you have any good source?

Konrad Rudolph 2010-03-30 14:01:02

It's not *provably* better than `winter|wombat|win`, it just saves you from having to trust the regexp compiler to do a good job. It *is* better than running N separate matches, which was what the OP had.

Kilian Foth 2010-03-30 14:41:02

Answer 2

+2 A:

I don't think this is a job for regular expressions. You might be better off searching each message word by word and checking each word against your word list. With the approach you have, you're searching each message n times where n is the number of words you want to find - it's no wonder that it takes a while.

Jeff Yates 2010-03-30 13:37:10

Yes, that's all I'm trying to do. I assumed it was the most efficient/accurate way.

cam 2010-03-30 13:42:31

Answer 3

+1 A:

one thing you can easily do is match agaist all the words in one go by building an expression like:

\b(?:word1|word2|word3|....)\b

Then you can precompile the pattern and reuse it to look up all occurencesfor each email (not sure how you do this with .Net API, but there must be a way).

Another thing is instead of using the ignorecase flag, if you convert everything to lowercase, that might give you a small speed boost (need to profile it as it's implementation dependent). Don't forget to warm up the CLR when you profile.

ddimitrov 2010-03-30 13:42:16

If combining the words into one regex, they'll need to use groups for each word otherwise they won't be able to track what matched.

Jeff Yates 2010-03-30 13:44:19

@Jeff - That's what I thought as well. But I just realized that when using word boundaries the matches will always be the words themselves. So in fact you can simply add the value of each match to the wordFound List. See my answer for my implementation.

Steve Wortham 2010-03-30 21:10:12

@Steve: Good point. Seems obvious now you point it out. Cheers.

Jeff Yates 2010-03-31 02:02:38

Answer 4

+2 A:

Most of the time comes form matches that fail, so you want to minimize failures.

If the search keyword are not frequent, you can test for all of them at the same time (with regexp \b(aaa|bbb|ccc|....)\b), then you exclude the emails with no matches. The one that have at least one match, you do a thorough search.

Dan Andreatta 2010-03-30 13:48:46

Answer 5

+7 A:

I assume that the COM call nSearch.InnerHtml is pretty slow and you repeat the call for every single word that you are checking. You can simply cache the result of the call:

public List<string> Keyword_Search(HtmlNode nSearch)
{
    var wordFound = new List<string>();

    // cache inner HTML
    string innerHtml = nSearch.InnerHtml;

    foreach (string currWord in _keywordList)
    {
        bool isMatch = Regex.IsMatch(innerHtml, "\\b" + @currWord + "\\b",
                                              RegexOptions.IgnoreCase);
        if (isMatch)
        {
            wordFound.Add(currWord);
        }
    }
    return wordFound;
}

Another optimization would be the one suggested by Jeff Yates. E.g. by using a single pattern:

string pattern = @"(\b(?:" + string.Join("|", _keywordList) + @")\b)";

0xA3 2010-03-30 14:04:02

Simplified pattern: `@"(\b(?:" + string.Join("|", _keywordList) + @")\b)"`

Konrad Rudolph 2010-03-30 14:09:38

@Konrad Rudolph: Oh yes, sure, much nicer that way.

0xA3 2010-03-30 16:15:28

Answer 6

A:

If the regular expression is indeed the bottle neck, and even optimizing it (by concatenating the search words to one expression) doesn’t help, consider using a multi-pattern search algorithm, such as Wu-Manber.

I’ve posted a very simple implementation here on Stack Overflow. It’s written in C++ but since the code is straightforward it should be easy to translate it to C#.

Notice that this will find words anywhere, not just at word boundaries. However, this can be easily tested after you’ve checked whether the text contains any words; either once again with a regular expression (now you only test individual emails – much faster) or manually by checking the characters before and after the individual hits.

Konrad Rudolph 2010-03-30 14:08:16

Answer 7

A:

If your problem is about searching for outlook items containing certain string, you should get a gain from using outlooks search facilities...

see: http://msdn.microsoft.com/en-us/library/bb644806.aspx

Adrian 2010-03-30 14:18:26

Answer 8

A:

If your keyword search is straight literals, ie do not contain further regex pattern matches, then other method may be more appropriate. The following code demonstrates one such method, this code only goes through each email once, your code went through each email 290 time( twice)

        public List<string> FindKeywords(string emailbody, List<string> keywordList)
        {
            // may want to clean up the input a bit, such as replacing '.' and ',' with a space
            // and remove double spaces

            string emailBodyAsUppercase = emailbody.ToUpper();

            List<string> emailBodyAsList = new List<string>(emailBodyAsUppercase.Split(' '));

            List<string> foundKeywords = new List<string>(emailBodyAsList.Intersect(keywordList));


            return foundKeywords;
        }

Adrian 2010-03-30 15:52:43

Answer 9

A:

If you can use .Net 3.5+ and LINQ you could do something like this.

public static class HtmlNodeTools
{
    public static IEnumerable<string> MatchedKeywords(
        this HtmlNode nSearch,
             IEnumerable<string> keywordList)
    {
        //// as regex
        //var innerHtml = nSearch.InnerHtml;
        //return keywordList.Where(kw =>
        //       Regex.IsMatch(innerHtml, 
        //                     @"\b" + kw + @"\b",
        //                     RegexOptions.IgnoreCase)
        //        );

        //would be faster if you don't need the pattern matching
        var innerHtml = ' ' + nSearch.InnerHtml + ' ';
        return keywordList.Where(kw => innerHtml.Contains(kw));
    }
}
class Program
{
    static void Main(string[] args)
    {
        var keyworkList = new string[] { "hello", "world", "nomatch" };
        var h = new HtmlNode()
        {
            InnerHtml = "hi there hello other world"
        };

        var matched = h.MatchedKeywords(keyworkList).ToList();
        //hello, world
    }
}

... reused regex example ...

public static class HtmlNodeTools
{
    public static IEnumerable<string> MatchedKeywords(
        this HtmlNode nSearch,
             IEnumerable<KeyValuePair<string, Regex>> keywordList)
    {
        // as regex
        var innerHtml = nSearch.InnerHtml;
        return from kvp in keywordList
               where kvp.Value.IsMatch(innerHtml)
               select kvp.Key;
    }
}
class Program
{
    static void Main(string[] args)
    {
        var keyworkList = new string[] { "hello", "world", "nomatch" };
        var h = new HtmlNode()
        {
            InnerHtml = "hi there hello other world"
        };

        var keyworkSet = keyworkList.Select(kw => 
            new KeyValuePair<string, Regex>(kw, 
                                            new Regex(
                                                @"\b" + kw + @"\b", 
                                                RegexOptions.IgnoreCase)
                                                )
                                            ).ToArray();

        var matched = h.MatchedKeywords(keyworkSet).ToList();
        //hello, world
    }
}

Matthew Whited 2010-03-30 16:10:29

the nice think about this way is if you really only want all messages that contain one value in the keyworkList you can replace the `.ToList()` with `.Any()` and it will return after the first match.

Matthew Whited 2010-03-30 16:11:56

You may also consider passing IEnumerable<Regex> into the extension method with the keywords already converted to regular expressions. Then you wouldn't keep regenerating the values for each email you scan.

Matthew Whited 2010-03-30 16:13:55

Answer 10

+1 A:

This may be faster. You can leverage Regex Groups like this:

    public List<string> Keyword_Search(HtmlNode nSearch)
    {
        var wordFound = new List<string>();

        // cache inner HTML
        string innerHtml = nSearch.InnerHtml;
        string pattern = "(\\b" + string.Join("\\b)|(\\b", _keywordList) + "\\b)";
        Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
        MatchCollection myMatches = myRegex.Matches(innerHtml);

        foreach (Match myMatch in myMatches)
        {
            // Group 0 represents the entire match so we skip that one
            for (int i = 1; i < myMatch.Groups.Count; i++)
            {
                if (myMatch.Groups[i].Success)
                    wordFound.Add(_keywordList[i-1]);
            }
        }

        return wordFound;
    }

This way you're only using one regular expression. And the indices of the Groups should correlate with your _keywordList by an offset of 1, hence the line wordFound.Add(_keywordList[i-1]);

UPDATE:

After looking at my code again I just realized that putting the matches into Groups is really unnecessary. And Regex Groups have some overhead. Instead, you could remove the parenthesis from the pattern, and then simply add the matches themselves to the wordFound list. This would produce the same effect, but it'd be faster.

It'd be something like this:

public List<string> Keyword_Search(HtmlNode nSearch)
{
    var wordFound = new List<string>();

    // cache inner HTML
    string innerHtml = nSearch.InnerHtml;
    string pattern = "\\b(?:" + string.Join("|", _keywordList) + ")\\b";
    Regex myRegex = new Regex(pattern, RegexOptions.IgnoreCase);
    MatchCollection myMatches = myRegex.Matches(innerHtml);

    foreach (Match myMatch in myMatches)
    {
        wordFound.Add(myMatch.Value);
    }

    return wordFound;
}

Steve Wortham 2010-03-30 16:41:22

Thanks Steve, that's what I meant. If you don't mind could you also let us know whether there is any performance difference between using the ignorecase flag and converting the regex and text to lowercase in advance and matching without ignore case (when you spin the measuring loop you should take out the regex creation, but keep the innerHtml.toLowerCase() inside).

ddimitrov 2010-04-01 14:35:47

ansaurus

tags:

views:

answers:

Increasing Regex Efficiency

related questions