views:

1936

answers:

6

Hello fellow stackoverflowers!

I have a word list of 200.000 string entries, average string length is around 30 characters. This list of words are the key and to each key i have a domain object. I would like to find the domain objects in this collection by only knowing a part of the key. I.E. the search string "kov" would for example match the key "stackoverflow".

Currently I am using a Ternary Search Tree (TST), which usually will find the items within 100 milliseconds. This is however too slow for my requirements. The TST implementation could be improved with some minor optimizations and I could try to balance the tree. But i figured that these things would not give me the 5x - 10x speed improvement I am aiming at. I am assuming that the reason for being so slow is that i basically have to visit most nodes in the tree.

Any ideas on how to improve the speed of the algorithm? Are there any other algorithms that I should be looking at?

Thanks in advance, Oskar

A: 

Would you get any advantage having your trie keys comparable to the size of the machine register? So if you are on a 32bit box you can compare 4 characters at once instead of each character individually? I don't know how bad that would increase the size of your app.

Dan Pieczynski
+7  A: 

Suffix Array and q-gram index

If your strings have a strict upper bound on the size you might consider the use of a suffix array: Simply pad all your strings to the same maximum length using a special character (e.g. the null char). Then concatenate all strings and build a suffix array index over them.

This gives you a lookup runtime of m * log n where m is the length of your query string and n is the overall length of your combined strings. If this still isn't good enough and your m has a fixed, small length, and your alphabet Σ is restricted in size (say, Σ < 128 different characters) you can additionally build a q-gram index. This will allow retrieval in constant time. However, the q-gram table requires Σm entries (= 8 MiB in the case of just 3 characters, and 1 GiB for 4 characters!).

Making the index smaller

It might be possible to reduce the size of the q-gram table (exponentially, in the best case) by adjusting the hash function. Instead of assigning a unique number to every possible q-gram you might employ a lossy hash function. The table then would have to store lists of possible suffix array indices instead of just one suffix array entry corresponding to an exact match. This would entail that lookup is no longer constant, though, because all entries in the list would have to be considered.

By the way, I'm not sure if you're familiar with how a q-gram index works since the Internet isn't helpful on this topic. I've mentioned this before in another topic. I've therefore included a description and an algorithm for the construction in my bachelor thesis.

Proof of concept

I've written a very small C# proof of concept (since you stated otherwise that you worked with C#). It works, however it is very slow for two reasons. First, the suffix array creation simply sorts the suffixes. This alone has runtime n2 log n. There are far superior methods. Worse, however, is the fact that I use SubString to obtain the suffixes. Unfortunately, .NET creates copies of the whole suffix for this. To use this code in practice, make sure that you use in-place methods which do not copy any data around unnecessarily. The same is true for retrieving the q-grams from the string.

It would possibly even better to not construct the m_Data string used in my example. Instead, you could save a reference to the original array and simulate all my SubString accesses by working on this array.

Still, it's easy to see that this implementation has essentially expected constant time retrieval (if the dictionary is well-behaved)! This is quite an achievement that can't possibly be beaten by a search tree/trie!

class QGramIndex {
    private readonly int m_Maxlen;
    private readonly string m_Data;
    private readonly int m_Q;
    private int[] m_SA;
    private Dictionary<string, int> m_Dir = new Dictionary<string, int>();

    private struct StrCmp : IComparer<int> {
        public readonly String Data;
        public StrCmp(string data) { Data = data; }
        public int Compare(int x, int y) {
            return string.CompareOrdinal(Data.Substring(x), Data.Substring(y));
        }
    }

    private readonly StrCmp cmp;

    public QGramIndex(IList<string> strings, int maxlen, int q) {
        m_Maxlen = maxlen;
        m_Q = q;

        var sb = new StringBuilder(strings.Count * maxlen);
        foreach (string str in strings)
            sb.AppendFormat(str.PadRight(maxlen, '\u0000'));
        m_Data = sb.ToString();
        cmp = new StrCmp(m_Data);
        MakeSuffixArray();
        MakeIndex();
    }

    public int this[string s] { get { return FindInIndex(s); } }

    private void MakeSuffixArray() {
        // Approx. runtime: n^3 * log n!!!
        // But I claim the shortest ever implementation of a suffix array!
        m_SA = Enumerable.Range(0, m_Data.Length).ToArray();
        Array.Sort(m_SA, cmp);
    }

    private int FindInArray(int ith) {
        return Array.BinarySearch(m_SA, ith, cmp);
    }

    private int FindInIndex(string s) {
        int idx;
        if (!m_Dir.TryGetValue(s, out idx))
            return -1;
        return m_SA[idx] / m_Maxlen;
    }

    private string QGram(int i) {
        return i > m_Data.Length - m_Q ?
            m_Data.Substring(i) :
            m_Data.Substring(i, m_Q);
    }

    private void MakeIndex() {
        for (int i = 0; i < m_Data.Length; ++i) {
            int pos = FindInArray(i);
            if (pos < 0) continue;
            m_Dir[QGram(i)] = pos;
        }
    }
}

Example of usage:

static void Main(string[] args) {
    var strings = new [] { "hello", "world", "this", "is", "a",
                           "funny", "test", "which", "i", "have",
                           "taken", "much", "too", "far", "already" };

    var index = new QGramIndex(strings, 10, 3);

    var tests = new [] { "xyz", "aki", "ake", "muc", "uch", "too", "fun", "est",
                         "hic", "ell", "llo", "his" };

    foreach (var str in tests) {
        int pos = index[str];
        if (pos > -1)
            Console.WriteLine("\"{0}\" found in \"{1}\".", str, strings[pos]);
        else
            Console.WriteLine("\"{0}\" not found.", str);
    }
}
Konrad Rudolph
Is there a way to split up a q-gram table so that you don't thrash the disk using it?
Will
I'm not aware of any way. Your best bet might be reducing the alphabet by hashing several characters to the same key, and thus also reducing the table size exponentially. However, you need then to take care of collisions.
Konrad Rudolph
Why is padding the strings necessary? Is suffix array better than a suffix tree?
Rafał Dowgird
@Rafał: I'm padding the strings so I can calculate the index of it easily form the position in the suffix array. There are other solutions but these require modifying the suffix array, making construction harder.
Konrad Rudolph
A suffix array is better than a suffix tree because it can be stored much more space-efficiently. More importantly, you need a suffix *array* in order to create the q-gram index efficiently (at least I don't know any algorithm for creating a q-gram index for a suffix tree).
Konrad Rudolph
Good points about the tree. Back to the padding - as I understand, you can get the whole suffix from the table ("kov"->"koverflow"). Finding the original string by the suffix should be fast (or even by the prefix, if you build the table from reversed strings). Correct?
Rafał Dowgird
@Rafał: “Finding the original string by the suffix should be fast” – how? However, I recognize that padding the string is generally not a good way. It would be better to build the suffix array over the array of strings. This is possible, albeit slightly harder. I'll update my text accordingly.
Konrad Rudolph
You can find the string by suffix in O(log(N)) time if you keep an additional table of the strings sorted by their reverse. Or keep the strings sorted naturally and build the suffix array from reversed strings, getting prefixes instead of suffixes.
Rafał Dowgird
@Rafał: Have a look at my follow-up post. However, in response to your log(N) proposition: keep in mind that your N here isn't just 200,000 but instead it's the number of all suffixes, which is much higher.
Konrad Rudolph
A: 

would it be possible to "hash" the key value ? basically have a 2nd tree will all the possible values to search for pointing to a list of keys into the 1st tree.

You're going to need 2 trees; 1st one is a hash value to the domain object. the 2nd tree is the search strings to the hash value. the 2nd tree has multiple keys to the same hash value.

Example tree 1: STCKVRFLW -> domain object

tree 2: stack -> STCKVRFLW,STCK over -> STCKVRFLW, VRBRD, VR

So using the search for on the 2nd tree gives you a list of keys to search on the 1st tree.

jim
+1  A: 

Here's a WAG for you. I am in NO WAY Knuthian in my algorithm savvy

Okay, so the naiive Trie encodes string keys by starting at the root of the tree and moving down branches that match each letter in the key, starting at the first letter of the key. So the key "foo" would be mapped to (root)->f->fo->foo and the value would be stored in the location pointed to by the 'foo' node.

You are searching for ANY substring within the key, not just substrings that start at the beginning of the key.

So, what you need to do, is associate a node with ANY key that contains that particular substring. In the foo example I gave before, you would NOT have found a reference to foo's value under the nodes 'f' and 'fo'. In a TST that supports the type of searches you're looking to do, you'd not only find the foo object under all three nodes ('f', 'fo', and 'foo'), you'd also find it under 'o' and 'oo' as well.

There are a couple obvious consequences to expanding the search tree to support this type of indexing. First, you've just exploded the size of the tree. Staggeringly. If you can store it and use it in an efficient manner, your searches will take O(1) time. If your keys remain static, and you can find a way to partition the index so you don't take a huge IO penalty in using it, this might amortize to be worth while.

Second, you are going to find that searches for small strings will result in massive numbers of hits, which may make your search useless unless you, say, put a minimum length on search terms.

On the bright side, you might also find that you can compress the tree via tokenization (like zip compression does) or by compressing nodes that don't branch down (i.e., if you have 'w'->'o'->'o'-> and the first 'o' doesn't branch, you can safely collapse it to 'w'->'oo'). Maybe even a wicked-ass hash could make things easier...

Anyhow, WAG as I said.

Will
A: 

Choose a minimum search string size (eg. four characters). Go through your list of string entries and build up a dictionary of every four character substring, mapping to a list of entries that the substring appears in. When you do a search, look up based on the first four characters of the search string to get an initial set, then narrow down that initial set to only those that match the full search string.

The worst case of this is O(n), but you'll only get that if your string entries are almost all identical. The lookup dictionary is likely to be quite large, so it's probably a good idea to store it on disk or use a relational database :-)

Simon Howard
A: 

/EDIT: A friend of mine just pointed out a stupid assumption in my construction of the q-gram table. The construction can be made much simpler – and consequently, much faster. I've edited the source code and explanation to reflect this. I think it might be the final solution.

Inspired by Rafał Dowgird's comment to my previous answer, I've updated my code. I think this merits an own answer however, since it's also quite long. Instead of padding the existing strings, this code builds the index over the original array of strings. Instead of storing a single position, the suffix array stores a pair: the index of the target string and the position of the suffix in that string. In the result, only the first number is needed. However, the second number is necessary for the construction of the q-gram table.

The new version of the algorithm builds the q-gram table by walking over the suffix array instead of the original strings. This saves the binary search of the suffix array. Consequently, the runtime of the construction drops from O(n * log n) down to O(n) (where n is the size of the suffix array).

Notice that, like my first solution, use of SubString results in a lot of unnecessary copies. The obvious solution is to write an extension method that creates a lightweight wrapper instead of copying the string. The comparison then has to be slightly adapted. This is left as an exercise for the reader. ;-)

using Position = System.Collections.Generic.KeyValuePair<int, int>;

class QGramIndex {
    private readonly int m_Q;
    private readonly IList<string> m_Data;
    private Position[] m_SA;
    private Dictionary<string, int> m_Dir;

    public QGramIndex(IList<string> strings, int q) {
        m_Q = q;
        m_Data = strings;
        MakeSuffixArray();
        MakeIndex();
    }

    public int this[string s] { get { return FindInIndex(s); } }

    private int FindInIndex(string s) {
        int idx;
        if (!m_Dir.TryGetValue(s, out idx))
            return -1;
        return m_SA[idx].Key;
    }

    private void MakeSuffixArray() {
        int size = m_Data.Sum(str => str.Length < m_Q ? 0 : str.Length - m_Q + 1);
        m_SA = new Position[size];
        int pos = 0;
        for (int i = 0; i < m_Data.Count; ++i)
            for (int j = 0; j <= m_Data[i].Length - m_Q; ++j)
                m_SA[pos++] = new Position(i, j);

        Array.Sort(
            m_SA,
            (x, y) => string.CompareOrdinal(
                m_Data[x.Key].Substring(x.Value),
                m_Data[y.Key].Substring(y.Value)
            )
        );
    }

    private void MakeIndex() {
        m_Dir = new Dictionary<string, int>(m_SA.Length);

        // Every q-gram is a prefix in the suffix table.
        for (int i = 0; i < m_SA.Length; ++i) {
            var pos = m_SA[i];
            m_Dir[m_Data[pos.Key].Substring(pos.Value, 5)] = i;
        }
    }
}

Usage is the same as in the other example, minus the required maxlen argument for the constructor.

Konrad Rudolph