views:

704

answers:

7

Assuming I do not want to use external libraries or more than a dozen or so extra lines of code (i.e. clear code, not code golf code), can I do better than string.Contains to handle a collection of input strings and a collection of keywords to check for?

Obviously one can use objString.Contains(objString2) to do a simple substring check. However, there are many well-known algorithms which are able to do better than this under special circumstances, particularly if one is working with multiple strings. But sticking such an algorithm into my code would probably add length and complexity, so I'd rather use some sort of shortcut based on a built in function.

E.g. an input would be a collection of strings, a collection of positive keywords, and a collection of negative keywords. Output would be a subset of the first collection of keywords, all of which had at least 1 positive keyword but 0 negative keywords.

Oh, and please don't mention regular expressions as a suggested solutions.

It may be that my requirements are mutually exclusive (not much extra code, no external libraries or regex, better than String.Contains), but I thought I'd ask.

Edit:

A lot of people are only offering silly improvements that won't beat an intelligently used call to contains by much, if anything. Some people are trying to call Contains more intelligently, which completely misses the point of my question. So here's an example of a problem to try solving. LBushkin's solution is an example of someone offering a solution that probably is asymptotically better than standard contains:

Suppose you have 10,000 positive keywords of length 5-15 characters, 0 negative keywords (this seems to confuse people), and 1 1,000,000 character string. Check if the 1,000,000 character string contains at least 1 of the positive keywords.

I suppose one solution is to create an FSA. Another is delimit on spaces and use hashes.

A: 

Well, there is the Split() method you can call on a string. You could split your input strings into arrays of words using Split() then do a one-to-one check of words with keywords. I have no idea if or under what circumstances this would be faster than using Contains(), however.

DanM
+1  A: 

If you add this extension method:

public static bool ContainsAny(this string testString, IEnumerable<string> keywords)
{
    foreach (var keyword in keywords)
    {
        if (testString.Contains(keyword))
            return true;
    }
    return false;
}

Then this becomes a one line statement:

var results = testStrings.Where(t => !t.ContainsAny(badKeywordCollection)).Where(t => t.ContainsAny(goodKeywordCollection));

This isn't necessarily any faster than doing the contains checks, except that it will do them efficiently, due to LINQ's streaming of results preventing any unnecessary contains calls.... Plus, the resulting code being a one liner is nice.

Reed Copsey
He is using .NET 2.0
Stan R.
Note: He also mentions negative keywords which your code does not address
Nathan Koop
@Nathan: I find badKeywordCollection clear enough ;o)
Fredrik Mörk
http://stackoverflow.com/questions/498686/net-is-string-contains-faster-than-string-indexof/498754 <- so IndexOf might be better, and quick overload would give you negative keywords
Yuriy Faktorovich
I already know how to solve this problem using contains, and I can prevent the unneeded calls to Contains myself. This doesn't really answer my question of how to do better than Contains.
Brian
+2  A: 

Your discussion of "negative and positive" keywords is somewhat confusing - and could use some clarification to get more complete answers.

As with all performance related questions - you should first write the simple version and then profile it to determine where the bottlenecks are - these can be unintuitive and hard to predict. Having said that...

One way to optimize the search may (if you are always searching for "words" - and not phrases that could contains spaces) would be to build a search index of from your string.

The search index could either be a sorted array (for binary search) or a dictionary. A dictionary would likely prove faster - both because dictionaries are hashmaps internally with O(1) lookup, and a dictionary will naturally eliminate duplicate values in the search source - thereby reducing the number of comparions you need to perform.

The general search algorithm is:

For each string you are searching against:

  • Take the string you are searching within and tokenize it into individual words (delimited by whitespace)
  • Populate the tokens into a search index (either a sorted array or dictionary)
  • Search the index for your "negative keywords", if one is found, skip to the next search string
  • Search the index for your "positive keywords", when one is found, add it to a dictionary as they (you could also track a count of how often the word appears)

Here's an example using a sorted array and binary search in C# 2.0:

NOTE: You could switch from string[] to List<string> easily enough, I leave that to you.

string[] FindKeyWordOccurence( string[] stringsToSearch,
                               string[] positiveKeywords, 
                               string[] negativeKeywords )
{
   Dictionary<string,int> foundKeywords = new Dictionary<string,int>();
   foreach( string searchIn in stringsToSearch )
   {
       // tokenize and sort the input to make searches faster 
       string[] tokenizedList = searchIn.Split( ' ' );
       Array.Sort( tokenizedList );

       // if any negative keywords exist, skip to the next search string...
       foreach( string negKeyword in negativeKeywords )
           if( Array.BinarySearch( tokenizedList, negKeyword ) >= 0 )
               continue; // skip to next search string...

       // for each positive keyword, add to dictionary to keep track of it
       // we could have also used a SortedList, but the dictionary is easier
       foreach( string posKeyword in positiveKeyWords )
           if( Array.BinarySearch( tokenizedList, posKeyword ) >= 0 )
               foundKeywords[posKeyword] = 1; 
   }

   // convert the Keys in the dictionary (our found keywords) to an array...
   string[] foundKeywordsArray = new string[foundKeywords.Keys.Count];
   foundKeywords.Keys.CopyTo( foundKeywordArray, 0 );
   return foundKeywordsArray;
}

Here's a version that uses a dictionary-based index and LINQ in C# 3.0:

NOTE: This is not the most LINQ-y way to do it, I could use Union() and SelectMany() to write the entire algorithm as a single big LINQ statement - but I find this to be easier to understand.

public IEnumerable<string> FindOccurences( IEnumerable<string> searchStrings,
                                           IEnumerable<string> positiveKeywords,
                                           IEnumerable<string> negativeKeywords )
    {
        var foundKeywordsDict = new Dictionary<string, int>();
        foreach( var searchIn in searchStrings )
        {
            // tokenize the search string...
            var tokenizedDictionary = searchIn.Split( ' ' ).ToDictionary( x => x );
            // skip if any negative keywords exist...
            if( negativeKeywords.Any( tokenizedDictionary.ContainsKey ) )
                continue;
            // merge found positive keywords into dictionary...
            // an example of where Enumerable.ForEach() would be nice...
            var found = positiveKeywords.Where(tokenizedDictionary.ContainsKey)
            foreach (var keyword in found)
                foundKeywordsDict[keyword] = 1;
        }
        return foundKeywordsDict.Keys;
    }
LBushkin
This seems more complicated (and no faster) than just looping through with string.Contains. Where's the benefit here over doing multiple Contains() calls?
Reed Copsey
Ignoring negative and positive keywords will probably still yield mostly the same answers as what I'll end up with if I use both anyhow, so feel free to ignore that part of the discussion.
Brian
It's the same reason why full text indexex are faster than BoyerMoore string scans. BinarySearch() on a sorted array is O(log-2 N) whereas string.Contains() is O(N). If the search string is long enough and you have enough tokens to find, this can add up. On short string or with few keywords, it may actually take longer.
LBushkin
@LBushkin: Using hashing (a dictionary, since it's .net 2.0) would probably be faster than using Binary search.
Brian
Possibly - I've added a variation using LINQ that employs this approach.
LBushkin
@Brian - the major benefit of a dictionary over an array is not only that hashing is faster than binary search, but also that a dictionary will (naturally) only contain a unique set of words that appear in the search string - thereby reducing the number of comparison necessary. We could eliminate duplicates from (make distinct) the sorted array, but that takes extra effort that the dictionary easily adds.
LBushkin
@LBushkin: If you're going to ignore the fact that I'm using .net 2.0, a hashset makes more sense than a dictionary.
Brian
A: 

First get rid of all the strings that contain negative words. I would suggest doing this using the Contains method. I would think that Contains() is faster then splitting, sorting, and searching.

Nick
I already know how to do all this, and I agree that Contains is faster than splitting, sorting, etc. But I do know that there are algorithms that are asymptotically faster than contains in the case that you have multiple keywords. I suspect that technically, regex is one such algorithm...if you ignore the overhead.
Brian
A: 

Seems to me that the best way to do this is take your match strings (both positive and negative) and compute a hash of them. Then march through your million string computing n hashes (in your case it's 10 for strings of length 5-15) and match against the hashes for your match strings. If you get hash matches, then you do an actual string compare to rule out the false positive. There are a number of good ways to optimize this by bucketing your match strings by length and creating hashes based on the string size for a particular bucket.

So you get something like:

IList<Buckets> buckets = BuildBuckets(matchStrings);
int shortestLength = buckets[0].Length;
for (int i = 0; i < inputString.Length - shortestLength; i++) {
    foreach (Bucket b in buckets) {
        if (i + b.Length >= inputString.Length)
            continue;
        string candidate = inputString.Substring(i, b.Length);
        int hash = ComputeHash(candidate);

        foreach (MatchString match in b.MatchStrings) {
            if (hash != match.Hash)
                continue;
            if (candidate == match.String) {
                if (match.IsPositive) {
                    // positive case
                }
                else {
                    // negative case
                }
            }
        }
    }
}
plinth
+1  A: 

If you're truly just looking for space-delimited words, this code would be a very simple implementation:

    static void Main(string[] args)
    {
        string sIn = "This is a string that isn't nearly as long as it should be " +
            "but should still serve to prove an algorithm";
        string[] sFor = { "string", "as", "not" };
        Console.WriteLine(string.Join(", ", FindAny(sIn, sFor)));
    }

    private static string[] FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Intersect(hsFor).ToArray();
    }

If you only wanted a yes/no answer (as I see now may have been the case) there's another method of hashset "Overlaps" that's probably better optimized for that:

    private static bool FindAny(string searchIn, string[] searchFor)
    {
        HashSet<String> hsIn = new HashSet<string>(searchIn.Split());
        HashSet<String> hsFor = new HashSet<string>(searchFor);
        return hsIn.Overlaps(hsFor);
    }
BlueMonkMN
I think this is probably the best answer thus far. It offers everything I asked for. A shame it won't run on .Net 2.0, but it definitely offers a short, easily understood alternative to contains to handle multiple keywords.
Brian
A: 

To optimize Contains(), you need a tree (or trie) structure of your positive/negative words.

That should speed up everything (O(n) vs O(nm), n=size of string, m=avg word size) and the code is relatively small & easy.

Ray