ansaurus

Question

Efficiently check string for one of several hundred possible suffixes

Answer 1

+2 A:

I think that building your own automata would be the most efficient way.. it's a sort of your second solution, according to which, starting from a finite set of suffixes, it generates an automaton fitted for that suffixes.

I think you can easily use flex to do it, taking care of reversing the input or handling in a special way the fact that you are looking just for suffixes (just for efficienty matters)..

By the way using a Rabin-Karp approach would be efficient too since your suffixes will be short. You can fit a hashset with all the suffixes needed and then

take a string
take the suffix
calculate the hash of the suffix
check if suffix is in the table

Jack 2010-04-04 17:14:15

Answer 2

+3 A:

I would reverse all of the suffix strings, build a prefix tree of them and then test the reverse of your IP string against that.

patros 2010-04-04 17:14:41

Answer 3

+11 A:

If the suffixes don't contain any expansions/rules (like a regex), you could build a Trie of the suffixes in reverse order, and then match the string based on that. For instance

suffixes:
  foo
  bar
  bao

reverse order suffix trie:
  o
   -a-b  (matches bao)
   -o-f  (matches foo)
  r-a-b  (matches bar)

These can then be used to match your string:

"mystringfoo" -> reverse -> "oofgnirtsym" -> trie match -> foo suffix

James Kolpack 2010-04-04 17:18:20

Answer 4

+4 A:

You mention that you're looking at second-level domain names only, so even without knowing the precise set of matching domains, you could extract the relevant portion of the input string.

Then simply use a hashtable. Dimension it in such a way that there are no collisions, so you don't need buckets; lookups will be exactly O(1). For small hash types (e.g. 32 bits), you'd want to check if the strings really match. For a 64-bit hash, the probability of another domain colliding with one of the hashes in your table is already so low (order 10^-17) that you can probably live with it.

Thomas 2010-04-04 17:20:59

Answer 5

A:

Just create a 26x26 array of set of domains. e.g. thisArray[0][0] will be the domains that end in 'aa', thisArray[0][1] is all the domains that end in 'ab' and so on...

Once you have that, just search your array for thisArray[2nd last char of hostname][last char of hostname] to get the possible domains. If there's more than one at that stage, just brute force the rest.

Peter Alexander 2010-04-04 17:29:56

Answer 6

A:

I think that the solution should be very different depending on the type of input strings. If the strings are some kind of string class that can be iterated from the end (such as stl strings) it is a lot easier than if they are NULL-terminated C-strings.

String Class

Iterate the string backwards (don't make a reverse copy - use some kind of backward iterator). Build a Trie where each node consists of two 64-bit words, one pattern and one bitmask. Then check 8 characters at a time in each level. The mask is used if you want to match less than 8 characters - e.g. deny "*.org" would give a mask with 32 bits set. The mask is also used as termination criteria.

C strings

Construct an NDFA that matches the strings on a single-pass over them. That way you don't have to first iterate to the end but can instead use it in one pass. An NDFA can be converted to a DFA, which will probably make the implementation more efficient. Both construction of the NDFA and conversion to DFA will probably be so complex that you will have to write tools for it.

Anders Abel 2010-04-04 17:50:45

Answer 7

A:

After some research and deliberation I've decided to go with trie/finite state machine approach.

The string is parsed starting from the last character going backwards using a TRIE as long as the portion of suffix that was parsed so far can correspond to multiple suffixes. At some point we either hit the first character of one of the possible suffixes which means that we have a match, hit a dead end, which means there are no more possible matches or get into situation where there is only one suffix candidate. In this case we just do compare remainder of the suffix.

Since trie lookups are constant time, worst case complexity is o(maximum suffix length). The funtion turned out to be pretty fast. On 2.8Ghz Core i5 it can check 33,000,000 strings per second for 2K possible suffixes. 2K suffixes totaling 18 kilobytes, expanded to 320kb trie/state machine table. I guess that I could have stored it more efficiently but this solution seems to work good enough for the time being.

Since suffix list was so large, I didn't want to code it all by hand so I ended up writing C# application that generated C code for the suffix checking function:

    public static uint GetFourBytes(string s, int index)
    {
        byte[] bytes = new byte[4] { 0, 0, 0, 0};
        int len = Math.Min(s.Length - index, 4);
        Encoding.ASCII.GetBytes(s, index, len, bytes, 0);
        return BitConverter.ToUInt32(bytes, 0);
    }

    public static string ReverseString(string s)
    {
        char[] chars = s.ToCharArray();
        Array.Reverse(chars);
        return new string(chars);
    }

    static StringBuilder trieArray = new StringBuilder();
    static int trieArraySize = 0;

    static void Main(string[] args)
    {
        // read all non-empty lines from input file
        var suffixes = File
            .ReadAllLines(@"suffixes.txt")
            .Where(l => !string.IsNullOrEmpty(l));

        var reversedSuffixes = suffixes
            .Select(s => ReverseString(s));

        int start = CreateTrieNode(reversedSuffixes, "");

        string outFName = @"checkStringSuffix.debug.h";
        if (args.Length != 0 && args[0] == "--release")
        {
            outFName = @"checkStringSuffix.h";
        }

        using (StreamWriter wrt = new StreamWriter(outFName))
        {
            wrt.WriteLine(
                "#pragma once\n\n" +
                "#define TRIE_NONE -1000000\n"+
                "#define TRIE_DONE -2000000\n\n"
                );

            wrt.WriteLine("const int trieArray[] = {{{0}\n}};", trieArray);

            wrt.WriteLine(
                "inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {\n"+
                "   int len = trie[0];\n"+
                "   if (curr - str < len) return false;\n"+
                "   const char* cmp = (const char*)(trie + 1);\n"+
                "   while (len-- > 0) {\n"+
                "       if (*--curr != *cmp++) return false;\n"+
                "   }\n"+
                "   return true;\n"+
                "}\n\n"+
                "bool checkStringSuffix(const char* str, int len) {\n" +
                "   if (len < " + suffixes.Select(s => s.Length).Min().ToString() + ") return false;\n" +
                "   const char* curr = (str + len - 1);\n"+
                "   int currTrie = " + start.ToString() + ";\n"+
                "   while (curr >= str) {\n" +
                "       assert(*curr >= 0x20 && *curr <= 0x7f);\n" +
                "       currTrie = trieArray[currTrie + *curr - 0x20];\n" +
                "       if (currTrie < 0) {\n" +
                "           if (currTrie == TRIE_NONE) return false;\n" +
                "           if (currTrie == TRIE_DONE) return true;\n" +
                "           return checkSingleSuffix(str, curr, trieArray - currTrie - 1);\n" +
                "       }\n"+
                "       --curr;\n"+
                "   }\n" +
                "   return false;\n"+
                "}\n"
                );
        }        
    }

    private static int CreateTrieNode(IEnumerable<string> suffixes, string prefix)
    {
        int retVal = trieArraySize;

        if (suffixes.Count() == 1)
        {
            string theSuffix = suffixes.Single();
            trieArray.AppendFormat("\n\t/* {1} - {2} */ {0}, ", theSuffix.Length, trieArraySize, prefix);
            ++trieArraySize;
            for (int i = 0; i < theSuffix.Length; i += 4)
            {
                trieArray.AppendFormat("0x{0:X}, ", GetFourBytes(theSuffix, i));
                ++trieArraySize;
            }

            retVal = -(retVal + 1);
        }
        else
        {
            var groupByFirstChar =
                from s in suffixes
                let first = s[0]
                let remainder = s.Substring(1)
                group remainder by first;

            string[] trieIndexes = new string[0x60];
            for (int i = 0; i < trieIndexes.Length; ++i)
            {
                trieIndexes[i] = "TRIE_NONE";
            }

            foreach (var g in groupByFirstChar)
            {
                if (g.Any(s => s == string.Empty))
                {
                    trieIndexes[g.Key - 0x20] = "TRIE_DONE";
                    continue;
                }
                trieIndexes[g.Key - 0x20] = CreateTrieNode(g, g.Key + prefix).ToString();
            }
            trieArray.AppendFormat("\n\t/* {1} - {2} */ {0},", string.Join(", ", trieIndexes), trieArraySize, prefix);
            retVal = trieArraySize;
            trieArraySize += 0x60;
        }

        return retVal;
    }

So it generates code like this:

    inline bool checkSingleSuffix(const char* str, const char* curr, const int* trie) {
       int len = trie[0];
       if (curr - str < len) return false;
       const char* cmp = (const char*)(trie + 1);
       while (len-- > 0) {
           if (*--curr != *cmp++) return false;
       }
       return true;
    }

    bool checkStringSuffix(const char* str, int len) {
       if (len < 5) return false;
       const char* curr = (str + len - 1);
       int currTrie = 81921;
       while (curr >= str) {
           assert(*curr >= 0x20 && *curr <= 0x7f);
           currTrie = trieArray[currTrie + *curr - 0x20];
           if (currTrie < 0) {
               if (currTrie == TRIE_NONE) return false;
               if (currTrie == TRIE_DONE) return true;
               return checkSingleSuffix(str, curr, trieArray - currTrie - 1);
           }
           --curr;
       }
       return false;
    }

Since for my particular set of data len in checkSingleSuffix was never more than 9, I tried to replace the comparison loop with switch (len) and hardcoded comparison routines that compared up to 8 bytes of data at a time but it didn't affect overall performance at all either way.

Thanks for everyone who contributed their ideas!

Ghostrider 2010-04-07 01:20:41

ansaurus

tags:

views:

answers:

Efficiently check string for one of several hundred possible suffixes

String Class

C strings

related questions