Suffix Array and q-gram index
If your strings have a strict upper bound on the size you might consider the use of a suffix array: Simply pad all your strings to the same maximum length using a special character (e.g. the null char). Then concatenate all strings and build a suffix array index over them.
This gives you a lookup runtime of m * log n where m is the length of your query string and n is the overall length of your combined strings. If this still isn't good enough and your m has a fixed, small length, and your alphabet Σ is restricted in size (say, Σ < 128 different characters) you can additionally build a q-gram index. This will allow retrieval in constant time. However, the q-gram table requires Σm entries (= 8 MiB in the case of just 3 characters, and 1 GiB for 4 characters!).
Making the index smaller
It might be possible to reduce the size of the q-gram table (exponentially, in the best case) by adjusting the hash function. Instead of assigning a unique number to every possible q-gram you might employ a lossy hash function. The table then would have to store lists of possible suffix array indices instead of just one suffix array entry corresponding to an exact match. This would entail that lookup is no longer constant, though, because all entries in the list would have to be considered.
By the way, I'm not sure if you're familiar with how a q-gram index works since the Internet isn't helpful on this topic. I've mentioned this before in another topic. I've therefore included a description and an algorithm for the construction in my bachelor thesis.
Proof of concept
I've written a very small C# proof of concept (since you stated otherwise that you worked with C#). It works, however it is very slow for two reasons. First, the suffix array creation simply sorts the suffixes. This alone has runtime n2 log n. There are far superior methods. Worse, however, is the fact that I use SubString
to obtain the suffixes. Unfortunately, .NET creates copies of the whole suffix for this. To use this code in practice, make sure that you use in-place methods which do not copy any data around unnecessarily. The same is true for retrieving the q-grams from the string.
It would possibly even better to not construct the m_Data
string used in my example. Instead, you could save a reference to the original array and simulate all my SubString
accesses by working on this array.
Still, it's easy to see that this implementation has essentially expected constant time retrieval (if the dictionary is well-behaved)! This is quite an achievement that can't possibly be beaten by a search tree/trie!
class QGramIndex {
private readonly int m_Maxlen;
private readonly string m_Data;
private readonly int m_Q;
private int[] m_SA;
private Dictionary<string, int> m_Dir = new Dictionary<string, int>();
private struct StrCmp : IComparer<int> {
public readonly String Data;
public StrCmp(string data) { Data = data; }
public int Compare(int x, int y) {
return string.CompareOrdinal(Data.Substring(x), Data.Substring(y));
}
}
private readonly StrCmp cmp;
public QGramIndex(IList<string> strings, int maxlen, int q) {
m_Maxlen = maxlen;
m_Q = q;
var sb = new StringBuilder(strings.Count * maxlen);
foreach (string str in strings)
sb.AppendFormat(str.PadRight(maxlen, '\u0000'));
m_Data = sb.ToString();
cmp = new StrCmp(m_Data);
MakeSuffixArray();
MakeIndex();
}
public int this[string s] { get { return FindInIndex(s); } }
private void MakeSuffixArray() {
// Approx. runtime: n^3 * log n!!!
// But I claim the shortest ever implementation of a suffix array!
m_SA = Enumerable.Range(0, m_Data.Length).ToArray();
Array.Sort(m_SA, cmp);
}
private int FindInArray(int ith) {
return Array.BinarySearch(m_SA, ith, cmp);
}
private int FindInIndex(string s) {
int idx;
if (!m_Dir.TryGetValue(s, out idx))
return -1;
return m_SA[idx] / m_Maxlen;
}
private string QGram(int i) {
return i > m_Data.Length - m_Q ?
m_Data.Substring(i) :
m_Data.Substring(i, m_Q);
}
private void MakeIndex() {
for (int i = 0; i < m_Data.Length; ++i) {
int pos = FindInArray(i);
if (pos < 0) continue;
m_Dir[QGram(i)] = pos;
}
}
}
Example of usage:
static void Main(string[] args) {
var strings = new [] { "hello", "world", "this", "is", "a",
"funny", "test", "which", "i", "have",
"taken", "much", "too", "far", "already" };
var index = new QGramIndex(strings, 10, 3);
var tests = new [] { "xyz", "aki", "ake", "muc", "uch", "too", "fun", "est",
"hic", "ell", "llo", "his" };
foreach (var str in tests) {
int pos = index[str];
if (pos > -1)
Console.WriteLine("\"{0}\" found in \"{1}\".", str, strings[pos]);
else
Console.WriteLine("\"{0}\" not found.", str);
}
}