tags:

views:

985

answers:

8

I need a fast method to determine if a given string is in a list of strings.

The list of strings is not known until runtime but thereafter it will not change.

I could simply have a List<String> called strings and then do:

if (strings.Contains(item))

However this will perform poorly if there are many strings in the list.

I could also use a HashSet<String>, but this would necessitate calling GetHashCode on every incoming string as well as Equals, which would be a waste if there are e.g. only 3 strings in the list. Did I mention this needs to be fast?

I could when setting up, decide to use a List or a HashSet depending on the number of strings (e.g. use List for less than 10 strings, HashSet otherwise), rather like the logic in HybridDictionary.

As the strings are unicode, a standard Trie structure will not work, although a Radix tree/Patricia trie might. Are there any good C# implementations out there with benchmarks?

Some have mentioned bypassing String's GetHashCode and using a faster performing hash function. Are there any benchmarks out there?

Using LINQ expressions to essentially create an optimised switch statement is a novel approach which looks very interesting.

What else would work? Setup cost is not important, just the search speed.

If it matters, the incoming string values will rarely appear in the list.

+5  A: 

You could use a trie to hold the list of strings; tries were designed for fast re*trie*val. Here's one example of implementing a trie in C#.

Update: Powerpoint presentation on folded tries for Unicode and Ifo on implementation of a folded trie for Unicode (not C#)

Vinay Sajip
A trie would be great if the strings were just A-Z, or even just ASCII. But these are unicode.
Matt Howells
From the Wikipedia article I linked to: "Though it is most common, tries need not be keyed by character strings. The same algorithms can easily be adapted to serve similar functions of ordered lists of any construct, e.g., permutations on a list of digits, permutations on a list of shapes, etc." So you could do this with e.g. code points from a Unicode string.
Vinay Sajip
Got a link to a unicode implementation? Yes, I could use `GetBytes` and switch on the individual bytes, but I suspect that will not perform well.
Matt Howells
Here's another one - http://paste.lisp.org/display/12161 - the C# char type is a 16-bit Unicode character, do you need more?
Vinay Sajip
@Vinay, if you read the code you just linked you will see that it only supports a-z, A-Z and 0-9 characters, not the whole range of unicode characters.
Matt Howells
@Matt, sorry about that, I was a bit hasty. Can't you generalise from those two examples?
Vinay Sajip
@Vinay The generalization to unicode is non-trivial, as a naive implementation would have 65,536 nodes for each level of the trie. To do this you either need to break the UTF-16 into two bytes, or create a compact Trie / Radix trie / Patricia. If you can find a good .Net implementation of such a beast that would be an acceptable answer, but powerpoint presentations or academic papers with C code are not that useful for me. Thanks for your input though :)
Matt Howells
+2  A: 

Have you considered using the HashSet class (in .NET 3) instead?

Dan Diplo
... which will again call .GetHashCode and .Equals on every incoming string.
Lasse V. Karlsen
You can construct a HashSet with your chosen comparer using an overload: HashSet(T) Constructor (IEqualityComparer(T))http://msdn.microsoft.com/en-us/library/bb359100.aspx
Dan Diplo
+3  A: 

Check out these:

Jomo Fisher - Fast Switching with LINQ

Gustavo Guerra - StaticStringDictionary - Fast Switching with LINQ revisited

Jonathan
This is very interesting and sort of what I had in mind. +1
Matt Howells
+1  A: 

Re your "when the list is small" concern; if you don't mind using non-generic collections, System.Collections.Specialized.HybridDictionary does something like this; it encapsulates a System.Collections.Specialized.ListDictionary when small, or a System.Collections.Hashtable when it gets bigger (>10). Worth a look?


Otherwise; you could perhaps use HashSet<T> with a custom comparer? Then you can choose how expensive GetHashCode() is...

using System;
using System.Collections.Generic;

class CustomStringComparer : IEqualityComparer<string> {
    public bool Equals(string x, string y) {
        return string.Equals(x, y);
    }
    public int GetHashCode(string s) {
        return string.IsNullOrEmpty(s) ? 0 :
            s.Length + 273133 * (int)s[0];
    }
    private CustomStringComparer() { }
    public static readonly CustomStringComparer Default
        = new CustomStringComparer();
}
static class Program {
    static void Main() {
        HashSet<string> set = new HashSet<string>(
            new string[] { "abc", "def", "ghi" }, CustomStringComparer.Default);
        Console.WriteLine(set.Contains("abc"));
        Console.WriteLine(set.Contains("abcde"));
    }
}
Marc Gravell
It's a good idea, but on further reflection choosing the right hash function when you don't know how many strings will be in the list is very tricky. If its as simple as the function you wrote above, there will be many collisions with larger lists.
Matt Howells
+1  A: 

Perhaps the HybridDictionary is a better option here. Its internal use is dependent on how many items are in the collection.

Pat
A: 

As an aside, if memory serves, when a String is constructed, its HashValue is precomputed and stored with the String as an optimization for this type of use case. If you're using a character array or StringBuilder, this obviously doesn't apply, but for an immutable String it should.

EDIT: I am incorrect... Java does cache a String's HashCode, C# does not.

CoderTao
I think in this case that memory does not serve. I can see no evidence of hashcode caching when looking at `System.String` with Reflector.
Matt Howells
You are indeed correct. Java does do this, and I thought that C# would have ported the practice.
CoderTao
+1  A: 

I ended up doing this:

private static bool Contains(List<string> list, string value)
{
    bool contains = null != list.Find(str => str.ToLower().Equals(value.ToLower()));

    return contains;
}

I'm guessing you could create an extension method for List<string>, but this was sufficient for my needs.

hectorsosajr
I don't think this will perform fast enough for my needs ;)
Matt Howells
A: 

You could use string interning to do this very quickly. When building the list, you have to store your required string's interned format (the result of string.Intern()). Then, you need to compare against an interned string with object.ReferenceEquals - since interned strings have the same reference.

List<string> BuildList() {
    List<string> result;
    foreach (string str from StringSource())
         result.Add(str.Intern());
    return result;
}

bool CheckList(List<string> list, string stringToFind) { // list must be interned for this to work!
    return list.Find(str => object.ReferenceEquals(str, stringToFind)) != null;
}

This will result in a four-byte comparison for each list, and one pass over the original string. The intern pool of strings is built specifically for quick string comparison and finding if one already exists, so the intern operation should be quite fast.

configurator
Unfortunately `String.Intern` is really not that fast, and would have the undesirable side effect of permanently storing the string until my process ran out of memory (this application processes a lot of strings). Furthermore, subsequently searching the list using ReferenceEquals would be an O(N) operation.
Matt Howells
It's faster than normal string comparison, but yes, this wouldn't be good for processing a lot of strings.
configurator