ansaurus

Question

Answer 1

A:

Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.

DanM 2009-07-24 18:25:56

Answer 2

+1 A:

If you add this extension method:

public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
    foreach (var keyword in keywords)
    {
        if (testString.Contains(keyword))
            return true;
    }
    return false;
}

Then this becomes a one line statement:

var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));

This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

Reed Copsey 2009-07-24 18:27:59

He is using .NET 2.0

Stan R. 2009-07-24 18:38:10

Note: He also mentions negative keywords which your code does not address

Nathan Koop 2009-07-24 18:39:18

@Nathan: I find badKeywordCollection clear enough ;o)

Fredrik Mörk 2009-07-24 18:40:49

http://stackoverflow.com/questions/498686/net-is-string-contains-faster-than-string-indexof/498754 <- so IndexOf might be better, and quick overload would give you negative keywords

Yuriy Faktorovich 2009-07-24 18:43:08

I already know how to solve this problem using contains, and I can prevent the unneeded calls to Contains myself. This doesn't really answer my question of how to do better than Contains.

Brian 2009-07-24 18:43:24

Answer 3

+2 A:

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.

As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...

One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.

The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.

The general search algorithm is:

For each string you are searching against:

Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
Populate the tokens into a search index (either a sorted array or dictionary)
Search the index for your "negative keywords", if one is found, skip to the next search string
Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)

Here's an example using a sorted array and binary search in C# 2.0:

NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.

string[] FindKeyWordOccurence( string[] stringsToSearch,
                               string[] positiveKeywords, 
                               string[] negativeKeywords )
{
   Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
   foreach( string searchIn in stringsToSearch )
   {
       // tokenize and sort the input to make searches faster 
       string[] tokenizedList = searchIn.Split( ' ' );
       Array.Sort( tokenizedList );

       // if any negative keywords exist, skip to the next search string...
       foreach( string negKeyword in negativeKeywords )
           if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
               continue; // skip to next search string...

       // for each positive keyword, add to dictionary to keep track of it
       // we could have also used a SortedList, but the dictionary is easier
       foreach( string posKeyword in positiveKeyWords )
           if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
               foundKeywords[posKeyword] = 1; 
   }

   // convert the Keys in the dictionary (our found keywords) to an array...
   string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
   foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
   return foundKeywordsArray;
}

Here's a version that uses a dictionary-based index and LINQ in C# 3.0:

NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.

public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
                                           IEnumerable<string> positiveKeywords,
                                           IEnumerable<string> negativeKeywords )
    {
        var foundKeywordsDict = new Dictionary<string, int>();
        foreach( var searchIn in searchStrings )
        {
            // tokenize the search string...
            var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
            // skip if any negative keywords exist...
            if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
                continue;
            // merge found positive keywords into dictionary...
            // an example of where Enumerable.ForEach() would be nice...
            var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
            foreach (var keyword in found)
                foundKeywordsDict[keyword] = 1;
        }
        return foundKeywordsDict.Keys;
    }

LBushkin 2009-07-24 18:32:21

This seems more complicated (and no faster) than just looping through with string.Contains. Where's the benefit here over doing multiple Contains() calls?

Reed Copsey 2009-07-24 18:38:55

Ignoring negative and positive keywords will probably still yield mostly the same answers as what I'll end up with if I use both anyhow, so feel free to ignore that part of the discussion.

Brian 2009-07-24 18:51:25

It's the same reason why full text indexex are faster than BoyerMoore string scans. BinarySearch() on a sorted array is O(log-2 N) whereas string.Contains() is O(N). If the search string is long enough and you have enough tokens to find, this can add up. On short string or with few keywords, it may actually take longer.

LBushkin 2009-07-24 18:56:46

@LBushkin: Using hashing (a dictionary, since it's .net 2.0) would probably be faster than using Binary search.

Brian 2009-07-24 19:01:20

Possibly - I've added a variation using LINQ that employs this approach.

LBushkin 2009-07-24 19:16:56

@Brian - the major benefit of a dictionary over an array is not only that hashing is faster than binary search, but also that a dictionary will (naturally) only contain a unique set of words that appear in the search string - thereby reducing the number of comparison necessary. We could eliminate duplicates from (make distinct) the sorted array, but that takes extra effort that the dictionary easily adds.

LBushkin 2009-07-24 19:27:46

@LBushkin: If you're going to ignore the fact that I'm using .net 2.0, a hashset makes more sense than a dictionary.

Brian 2009-07-24 20:28:01

Answer 4

A:

First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.

Nick 2009-07-24 18:37:06

I already know how to do all this, and I agree that Contains is faster than splitting, sorting, etc. But I do know that there are algorithms that are asymptotically faster than contains in the case that you have multiple keywords. I suspect that technically, regex is one such algorithm...if you ignore the overhead.

Brian 2009-07-24 18:50:18

Answer 5

A:

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.

So you get something like:

IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
    foreach (Bucket b in buckets) {
        if (i + b.Length >= inputString.Length)
            continue;
        string candidate = inputString.Substring(i, b.Length);
        int hash = ComputeHash(candidate);

        foreach (MatchString match in b.MatchStrings) {
            if (hash != match.Hash)
                continue;
            if (candidate == match.String) {
                if (match.IsPositive) {
                    // positive case
                }
                else {
                    // negative case
                }
            }
        }
    }
}

plinth 2009-07-24 19:19:50

Answer 6

+1 A:

If you're truly just looking for space-delimited words, this code would be a very simple implementation:

    static void Main(string[] args)
    {
        string sIn = "This is a string that isn't nearly as long as it should be " +
            "but should still serve to prove an algorithm";
        string[] sFor = { "string", "as", "not" };
        Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
    }

    private static string[] FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Intersect(hsFor).ToArray();
    }

If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:

    private static bool FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Overlaps(hsFor);
    }

BlueMonkMN 2009-07-24 21:59:43

I think this is probably the best answer thus far. It offers everything I asked for. A shame it won't run on .Net 2.0, but it definitely offers a short, easily understood alternative to contains to handle multiple keywords.

Brian 2009-07-25 17:47:06

Answer 7

A:

To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.

That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Ray 2009-07-24 22:49:22

ansaurus

tags:

views:

answers:

C# Efficient Substring with many inputs

related questions