You have a list of URLs and you want reliable 5-10 ms response time, so running through a list of strings and doing a regex (or even string comparison) against each may not scale well as the list gets large (but is worth testing anyway as a baseline). My first thought for doing better is: can't we somehow pre-process this list into a form that allows for fast lookups? I think we can.
Assumptions I made to start:
- A URL is a list of tokens. To get them, we drop the protocol, drop the query string, and find the first "/" (and if none, add it to the end). The stuff before the "/" is the hostname, and the stuff after "/" is the path.
- We get the hostname tokens by splitting on ".".
- We get the path tokens by splitting on "/".
So then this URL, for example:
"www.stackoverflow.com/users/239289"
becomes the tokens:
"www", "stackoverflow", "com", "/", "users", "239289"
Note that the only "/" we allow is the one separating the hostname and path.
Here's my function to tokenize a URL - I did this in PHP, my language of choice for fast prototyping (and a lot more). It's not perfect code but it gets the job done reasonably well:
function tokenize_url($url) {
$pos = strpos($url, '/');
if ($pos === 0) {
// It's a path-only entry.
$hostname = '*';
$path = substr($url, 1);
} else if ($pos !== false) {
// It's a normal URL with hostname and path.
$hostname = substr($url, 0, $pos);
$path = substr($url, $pos + 1);
if ($path === false) {
$path = '';
}
} else {
// No slash found, assume it's a hostname only.
$hostname = $url;
$path = '';
}
if ($hostname !== '') {
$hostname_tokens = explode('.', $hostname);
} else {
$hostname_tokens = array();
}
if ($path !== '') {
$path_tokens = explode('/', $path);
} else {
$path_tokens = array();
}
return array_merge($hostname_tokens, array('/'), $path_tokens);
}
So I'm thinking I'll pre-process your list of URLs by going through it and tokenizing each URL, and storing it in a directed graph (basically nested associative arrays). This way, we only have to traverse the graph once for exact matches (a bit more to find wildcard matches), and it's O(1) at each step. We mark the end of our patterns against which to match by hanging a special symbol "%!%!%" off that node.
Here's the function to build the graph - hopefully the code is fairly self-explanatory:
function compile_site_list($site_list) {
$root = array();
foreach ($site_list as $url) {
$tokens = tokenize_url($url);
$node = &$root;
for ($i=0; $i<count($tokens); $i++) {
// The "%" forces PHP to evaluate this as a string, no matter what.
// Sadly, casting to a string doesn't do it!
$token = $tokens[$i] . '%';
// If this is our first time seeing this string here, make a
// blank node.
if (!(isset($node[$token]))) {
$node[$token] = array();
}
if ($i < (count($tokens) - 1)) {
// If we're not at the end yet, keep traversing down.
$node = &$node[$token];
} else {
// If we're at the end, mark it with our special marker.
$node[$token]['%!%!%'] = 1;
}
}
}
return $root;
}
So once you have your list of URLs to match against, you just have to call compile_site_list() once, and keep this around - in memory, or a serialized array, or something similar.
Now it's time to match a URL. First, when we get one, we need to scrub it as I mentioned above:
function scrub_url($url) {
// Get rid of the protocol (if present).
$pos = strpos($url, '://');
if ($pos !== false) {
$url = substr($url, $pos + 3);
}
// Get rid of any query string (if present).
$pos = strpos($url, '?');
if ($pos !== false) {
$url = substr($url, 0, $pos);
}
return $url;
}
To search the data structure we made, we take the tokens for the URL we're matching against and look for them recursively in the graph. As soon as we find "%!%!%" in the graph - we've got a match and it's game over.
However, if we get to the end of our tokens and haven't matched, we go back up and look for wildcards. If we find the wildcard, we let it consume as many tokens as it wants (except for "/") and see if that results in a match.
If none of that finds a match - it's not in the list.
Here's the code:
function search_compiled_list($url, $compiled_site_list) {
$url = scrub_url($url);
$tokens = tokenize_url($url);
return do_search($tokens, $compiled_site_list);
}
function do_search($tokens, $compiled_site_list) {
// Base cases for recursion:
if (isset($compiled_site_list['%!%!%'])) {
// If we're at a node that has our marker hanging off it - we found it!
return true;
} else if (count($tokens) === 0) {
// Otherwise, if we got somewhere and have no tokens left, we didn't
// find it.
return false;
}
// The "%" on the end forces PHP to evaluate this as a string, no
// matter what.
$token = $tokens[0] . '%';
if (isset($compiled_site_list[$token])) {
// Found an exact match!
$result = do_search(array_slice($tokens, 1),
$compiled_site_list[$token]);
if ($result === true) {
return true;
}
}
// Didn't find an exact match - let's check for wildcards.
if ((isset($compiled_site_list['*%'])) && ($tokens[0] !== '/')) {
// If we're matching the wildcard, let it consume as many tokens
// as it wants. The only thing it can't consume is /.
for ($i=1; $i<count($tokens); $i++) {
$result = do_search(array_slice($tokens, $i),
$compiled_site_list['*%']);
if ($result === true) {
return true;
}
}
}
return false;
}
So to see the whole thing in action - if you have $site_list which is an array of your URLs, you'd do:
$url_to_check = "http://www.stackoverflow.com/users/120262?tab=accounts";
$compiled_site_list = compile_site_list($site_list);
$result = search_compiled_list($url_to_check, $compiled_site_list);
var_dump($result);
I tested this code on a bunch of URLs and it seemed to work, but I make no claim to have tested it exhaustively. I think my reasoning is sound but am certainly open to comments/criticism. (I've had a similar problem bouncing around my head for a bit and this was a fun exercise.)
The end result is that the efficiency of the algorithm is dictated not by the number of URLs to match against, but really the length of the URL (in tokens) that you're trying to match. I did some anecdotal timing on my laptop, where I compiled the site list ahead of time and stored it serialized - unserializing it and searching took around 2 ms total, at worst.