ansaurus

Question

Answer 1

+5 A:

You could use a trie to hold the list of strings; tries were designed for fast re*trie*val. Here's one example of implementing a trie in C#.

Update: Powerpoint presentation on folded tries for Unicode and Ifo on implementation of a folded trie for Unicode (not C#)

Vinay Sajip 2009-07-20 12:05:38

A trie would be great if the strings were just A-Z, or even just ASCII. But these are unicode.

Matt Howells 2009-07-20 12:59:58

From the Wikipedia article I linked to: "Though it is most common, tries need not be keyed by character strings. The same algorithms can easily be adapted to serve similar functions of ordered lists of any construct, e.g., permutations on a list of digits, permutations on a list of shapes, etc." So you could do this with e.g. code points from a Unicode string.

Vinay Sajip 2009-07-20 13:16:35

Got a link to a unicode implementation? Yes, I could use `GetBytes` and switch on the individual bytes, but I suspect that will not perform well.

Matt Howells 2009-07-20 13:27:36

Here's another one - http://paste.lisp.org/display/12161 - the C# char type is a 16-bit Unicode character, do you need more?

Vinay Sajip 2009-07-20 13:32:01

@Vinay, if you read the code you just linked you will see that it only supports a-z, A-Z and 0-9 characters, not the whole range of unicode characters.

Matt Howells 2009-07-20 13:59:49

@Matt, sorry about that, I was a bit hasty. Can't you generalise from those two examples?

Vinay Sajip 2009-07-20 14:01:24

@Vinay The generalization to unicode is non-trivial, as a naive implementation would have 65,536 nodes for each level of the trie. To do this you either need to break the UTF-16 into two bytes, or create a compact Trie / Radix trie / Patricia. If you can find a good .Net implementation of such a beast that would be an acceptable answer, but powerpoint presentations or academic papers with C code are not that useful for me. Thanks for your input though :)

Matt Howells 2009-07-20 15:44:51

Answer 2

+2 A:

Have you considered using the HashSet class (in .NET 3) instead?

Dan Diplo 2009-07-20 12:07:00

... which will again call .GetHashCode and .Equals on every incoming string.

Lasse V. Karlsen 2009-07-20 12:23:51

You can construct a HashSet with your chosen comparer using an overload: HashSet(T) Constructor (IEqualityComparer(T))http://msdn.microsoft.com/en-us/library/bb359100.aspx

Dan Diplo 2009-07-20 13:14:52

Answer 3

+3 A:

Check out these:

Jomo Fisher - Fast Switching with LINQ

Gustavo Guerra - StaticStringDictionary - Fast Switching with LINQ revisited

Jonathan 2009-07-20 12:10:05

This is very interesting and sort of what I had in mind. +1

Matt Howells 2009-07-20 13:12:15

Answer 4

+1 A:

Re your "when the list is small" concern; if you don't mind using non-generic collections, System.Collections.Specialized.HybridDictionary does something like this; it encapsulates a System.Collections.Specialized.ListDictionary when small, or a System.Collections.Hashtable when it gets bigger (>10). Worth a look?

Otherwise; you could perhaps use HashSet<T> with a custom comparer? Then you can choose how expensive GetHashCode() is...

using System;
using System.Collections.Generic;

class CustomStringComparer : IEqualityComparer<string> {
    public bool Equals(string x, string y) {
        return string.Equals(x, y);
    }
    public int GetHashCode(string s) {
        return string.IsNullOrEmpty(s) ? 0 :
            s.Length + 273133 * (int)s[0];
    }
    private CustomStringComparer() { }
    public static readonly CustomStringComparer Default
        = new CustomStringComparer();
}
static class Program {
    static void Main() {
        HashSet<string> set = new HashSet<string>(
            new string[] { "abc", "def", "ghi" }, CustomStringComparer.Default);
        Console.WriteLine(set.Contains("abc"));
        Console.WriteLine(set.Contains("abcde"));
    }
}

Marc Gravell 2009-07-20 12:33:11

It's a good idea, but on further reflection choosing the right hash function when you don't know how many strings will be in the list is very tricky. If its as simple as the function you wrote above, there will be many collisions with larger lists.

Matt Howells 2009-07-20 19:44:34

Answer 5

+1 A:

Perhaps the HybridDictionary is a better option here. Its internal use is dependent on how many items are in the collection.

Pat 2009-07-20 12:36:45

Answer 6

A:

As an aside, if memory serves, when a String is constructed, its HashValue is precomputed and stored with the String as an optimization for this type of use case. If you're using a character array or StringBuilder, this obviously doesn't apply, but for an immutable String it should.

EDIT: I am incorrect... Java does cache a String's HashCode, C# does not.

CoderTao 2009-07-20 12:58:20

I think in this case that memory does not serve. I can see no evidence of hashcode caching when looking at `System.String` with Reflector.

Matt Howells 2009-07-20 13:09:49

You are indeed correct. Java does do this, and I thought that C# would have ported the practice.

CoderTao 2009-07-20 14:52:01

Answer 7

+1 A:

I ended up doing this:

private static bool Contains(List<string> list, string value)
{
    bool contains = null != list.Find(str => str.ToLower().Equals(value.ToLower()));

    return contains;
}

I'm guessing you could create an extension method for List<string>, but this was sufficient for my needs.

hectorsosajr 2009-07-20 13:15:05

I don't think this will perform fast enough for my needs ;)

Matt Howells 2009-07-20 13:29:39

Answer 8

A:

You could use string interning to do this very quickly. When building the list, you have to store your required string's interned format (the result of string.Intern()). Then, you need to compare against an interned string with object.ReferenceEquals - since interned strings have the same reference.

List<string> BuildList() {
    List<string> result;
    foreach (string str from StringSource())
         result.Add(str.Intern());
    return result;
}

bool CheckList(List<string> list, string stringToFind) { // list must be interned for this to work!
    return list.Find(str => object.ReferenceEquals(str, stringToFind)) != null;
}

This will result in a four-byte comparison for each list, and one pass over the original string. The intern pool of strings is built specifically for quick string comparison and finding if one already exists, so the intern operation should be quite fast.

configurator 2009-07-20 15:49:52

Unfortunately `String.Intern` is really not that fast, and would have the undesirable side effect of permanently storing the string until my process ran out of memory (this application processes a lot of strings). Furthermore, subsequently searching the list using ReferenceEquals would be an O(N) operation.

Matt Howells 2009-07-20 19:27:14

It's faster than normal string comparison, but yes, this wouldn't be good for processing a lot of strings.

configurator 2009-07-20 19:59:27

ansaurus

tags:

views:

answers:

Fast string comparison with list

related questions