ansaurus

Question

How to design a string matching algorithm where the input is the exact string and the lookup values are regex-ish?

Answer 1

A:

This is a partial answer, assuming that your patterns you are trying to match against are all either constant strings with no wildcards in them, or a sequence of strings separated by wilcards "*" that can match any string.

This problem has been studied quite a bit in the context of implementing network-based and host-based intrusion detection systems, where you have a bunch of patterns you are looking for in network traffic, where each pattern might be a sign of an intruder sending attack traffic at you.

In the special case where there are no wildcards at all in the patterns, and your set of patterns is changing infrequently, so you can afford to spend some time doing some precomputation of data structures when they change, a well-known way to do this is the Aho-Corasick algorithm:

http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

If you then want to generalize to allow wildcards, the following ideas might not have good worst-case performance, but would likely perform well in practice. Break up patterns that have wildcards in them into the "constant string" parts, e.g. break up the pattern "developer..com" into "developer." and ".com". Put those two strings in the list of ones you are searching for separately. Only if a URL coming in matches both developer. and .com would you then do some more post-processing to make sure it had them both in the desired order (as opposed to in the opposite order, like "a.com.developer.foo" would, and should thus not match the pattern "developer..com").

For large sets of patterns, Aho-Corasick can require lots of memory to store the state-machine that it represents. There have been other similar methods designed later to improve on it. For example, Google for the paper title "Advanced Algorithms for Fast and Scalable Deep Packet Inspection" by Kumar, Turner, and Williams.

I am aware other methods of solving this, too, which are patented by Cisco Systems. If there is any chance your company would license these methods, or already has some kind of bulk cross-licensing agreement with Cisco, I'd be happy to tell you more about those.

andy_fingerhut 2010-10-29 00:00:58

Answer 2

+3 A:

You have a list of URLs and you want reliable 5-10 ms response time, so running through a list of strings and doing a regex (or even string comparison) against each may not scale well as the list gets large (but is worth testing anyway as a baseline). My first thought for doing better is: can't we somehow pre-process this list into a form that allows for fast lookups? I think we can.

Assumptions I made to start:

A URL is a list of tokens. To get them, we drop the protocol, drop the query string, and find the first "/" (and if none, add it to the end). The stuff before the "/" is the hostname, and the stuff after "/" is the path.
We get the hostname tokens by splitting on ".".
We get the path tokens by splitting on "/".

So then this URL, for example:

"www.stackoverflow.com/users/239289"

becomes the tokens:

"www", "stackoverflow", "com", "/", "users", "239289"

Note that the only "/" we allow is the one separating the hostname and path.

Here's my function to tokenize a URL - I did this in PHP, my language of choice for fast prototyping (and a lot more). It's not perfect code but it gets the job done reasonably well:

function tokenize_url($url) {
    $pos = strpos($url, '/');
    if ($pos === 0) {
        // It's a path-only entry.
        $hostname = '*';
        $path = substr($url, 1);
    } else if ($pos !== false) {
        // It's a normal URL with hostname and path.
        $hostname = substr($url, 0, $pos);
        $path = substr($url, $pos + 1);
        if ($path === false) {
            $path = '';
        }
    } else {
        // No slash found, assume it's a hostname only.
        $hostname = $url;
        $path = '';
    }

    if ($hostname !== '') {
        $hostname_tokens = explode('.', $hostname);
    } else {
        $hostname_tokens = array();
    }

    if ($path !== '') {
        $path_tokens = explode('/', $path);
    } else {
        $path_tokens = array();
    }

    return array_merge($hostname_tokens, array('/'), $path_tokens);
}

So I'm thinking I'll pre-process your list of URLs by going through it and tokenizing each URL, and storing it in a directed graph (basically nested associative arrays). This way, we only have to traverse the graph once for exact matches (a bit more to find wildcard matches), and it's O(1) at each step. We mark the end of our patterns against which to match by hanging a special symbol "%!%!%" off that node.

Here's the function to build the graph - hopefully the code is fairly self-explanatory:

function compile_site_list($site_list) {
    $root = array();

    foreach ($site_list as $url) {
        $tokens = tokenize_url($url);
        $node = &$root;

        for ($i=0; $i<count($tokens); $i++) {
            // The "%" forces PHP to evaluate this as a string, no matter what.
            // Sadly, casting to a string doesn't do it!
            $token = $tokens[$i] . '%';

            // If this is our first time seeing this string here, make a
            // blank node.
            if (!(isset($node[$token]))) {
                $node[$token] = array();
            }

            if ($i < (count($tokens) - 1)) {
                // If we're not at the end yet, keep traversing down.
                $node = &$node[$token];
            } else {
                // If we're at the end, mark it with our special marker.
                $node[$token]['%!%!%'] = 1;
            }
        }
    }

    return $root;
}

So once you have your list of URLs to match against, you just have to call compile_site_list() once, and keep this around - in memory, or a serialized array, or something similar.

Now it's time to match a URL. First, when we get one, we need to scrub it as I mentioned above:

function scrub_url($url) {
    // Get rid of the protocol (if present).
    $pos = strpos($url, '://');
    if ($pos !== false) {
        $url = substr($url, $pos + 3);
    }

    // Get rid of any query string (if present).
    $pos = strpos($url, '?');
    if ($pos !== false) {
        $url = substr($url, 0, $pos);
    }

    return $url;
}

To search the data structure we made, we take the tokens for the URL we're matching against and look for them recursively in the graph. As soon as we find "%!%!%" in the graph - we've got a match and it's game over.

However, if we get to the end of our tokens and haven't matched, we go back up and look for wildcards. If we find the wildcard, we let it consume as many tokens as it wants (except for "/") and see if that results in a match.

If none of that finds a match - it's not in the list.

Here's the code:

function search_compiled_list($url, $compiled_site_list) {
    $url = scrub_url($url);
    $tokens = tokenize_url($url);

    return do_search($tokens, $compiled_site_list);
}

function do_search($tokens, $compiled_site_list) {
    // Base cases for recursion:
    if (isset($compiled_site_list['%!%!%'])) {
        // If we're at a node that has our marker hanging off it - we found it!
        return true;
    } else if (count($tokens) === 0) {
        // Otherwise, if we got somewhere and have no tokens left, we didn't
        // find it.
        return false;
    }

    // The "%" on the end forces PHP to evaluate this as a string, no
    // matter what.
    $token = $tokens[0] . '%';

    if (isset($compiled_site_list[$token])) {
        // Found an exact match!
        $result = do_search(array_slice($tokens, 1),
            $compiled_site_list[$token]);
        if ($result === true) {
            return true;
        }
    }

    // Didn't find an exact match - let's check for wildcards.
    if ((isset($compiled_site_list['*%'])) && ($tokens[0] !== '/')) {
        // If we're matching the wildcard, let it consume as many tokens
        // as it wants.  The only thing it can't consume is /.
        for ($i=1; $i<count($tokens); $i++) {
            $result = do_search(array_slice($tokens, $i),
                $compiled_site_list['*%']);
            if ($result === true) {
                return true;
            }
        }
    }

    return false;
}

So to see the whole thing in action - if you have $site_list which is an array of your URLs, you'd do:

$url_to_check = "http://www.stackoverflow.com/users/120262?tab=accounts";
$compiled_site_list = compile_site_list($site_list);
$result = search_compiled_list($url_to_check, $compiled_site_list);
var_dump($result);

I tested this code on a bunch of URLs and it seemed to work, but I make no claim to have tested it exhaustively. I think my reasoning is sound but am certainly open to comments/criticism. (I've had a similar problem bouncing around my head for a bit and this was a fun exercise.)

The end result is that the efficiency of the algorithm is dictated not by the number of URLs to match against, but really the length of the URL (in tokens) that you're trying to match. I did some anecdotal timing on my laptop, where I compiled the site list ahead of time and stored it serialized - unserializing it and searching took around 2 ms total, at worst.

beamrider9 2010-10-31 00:13:47

Note that you could eliminate some of the URL parsing code above with PHP's parse_url() function. I'm not going to add it out of a desire to keep the code straightforward and relatively language-agnostic.

beamrider9 2010-10-31 04:38:37

Answer 3

A:

I'm not positive I understood your problem correctly, but it sounds like a prefix tree (aka trie) might be quite handy here. Load all of your site list URLs into one and then walk down it quickly using the referrer URL. Should get you to the subset of "matches" very quickly. There are a number of trie variations worth looking into as well.

Yevgeniy Brikman 2010-10-31 06:32:27

ansaurus

tags:

views:

answers:

How to design a string matching algorithm where the input is the exact string and the lookup values are regex-ish?

related questions