views:

1006

answers:

2

I need to match input strings (URLs) against a large set (anywhere from 1k-250k) of string rules with simple wildcard support.

Requirements for wildcard support are as follows:

Wildcard (*) can only substitute a "part" of a URL. That is fragments of a domain, path, and parameters. For example, "*.part.part/*/part?part=part&part=*". The only exception to this rule is in the path area where "/*" should match anything after the slash.

Examples:

  • *.site.com/* -- should match sub.site.com/home.html, sub2.site.com/path/home.html
  • sub.site.*/path/* -- should match sub.site.com/path/home.html, sub.site.net/path/home.html, but not sub.site.com/home.html

Additional requirements:

  • Fast lookup (I realize "fast" is a relative term. Given the max 250k rules, still fall within < 1.5s if possible.)
  • Work within the scope of a modern desktop (e.g. not a server implementation)
  • Ability to return 0:n matches given a input string
  • Matches will have rule data attached to them

What is the best system/algorithm for such as task? I will be developing the solution in C++ with the rules themselves stored in a SQLite database.

+1  A: 

If I'm not mistaken, you can take string rule and break it up into domain, path, and query pieces, just like it's a URL. Then you can apply a standard wildcard matching algorithm with each of those pieces against the corresponding pieces from the URLs you want to test against. If all of the pieces match, the rule is a match.

Example

Rule: *.site.com/*
    domain => *.site.com
    path   => /*
    query  => [empty]

URL: sub.site.com/path/home.html
    domain => sub.site.com
    path   => /path/home.html
    query  => [empty]

Matching process:
    domain => *.site.com matches sub.site.com?     YES
    path   => /*         matches /path/home.html?  YES
    query  => [empty]    matches [empty]           YES

Result: MATCH

As you are storing the rules in a database I would store them already broken into those three pieces. And if you want uber-speed you could convert the *'s to %'s and then use the database's native LIKE operation to do the matching for you. Then you'd just have a query like

SELECT *
FROM   ruleTable
WHERE  @urlDomain LIKE ruleDomain
   AND @urlPath   LIKE rulePath
   AND @urlQuery  LIKE ruleQuery

where @urlDomain, @urlPath, and @urlQuery are variables in a prepared statement. The query would return the rules that match a URL, or an empty result set if nothing matches.

John Kugelman
I had previously thought LIKE/GLOB could only match input patterns but not patterns in the rules (e.g. columns) themselves. This is working great. As a side note, I'm actually getting better performance by not breaking the rules up but rather to have them completed (and domains reversed as per Chris Harris' comment).
NuSkooler
+2  A: 

First of all, one of the worst performing searches you can do is with a wildcard at both ends of the string ".domain.com/path" -- and I think you're going to hit this case a lot. So my first recommendation is to reverse the order of the domains as they're stored in your DB: com.domain.example/path1/path2/page.html. That will allow you to keep things much more tidy and only use wildcards in "one direction" on the string, which will provide MUCH faster lookups.

I think John mentions some good points about how to do this all within your DB. If that doesn't work I would use a regex library in C++ against the list. I bet you'll get the best performance and most general regex syntax that way.

Chris Harris