ansaurus

Question

Fast string matching algorithm with simple wildcards support

Answer 1

+1 A:

If I'm not mistaken, you can take string rule and break it up into domain, path, and query pieces, just like it's a URL. Then you can apply a standard wildcard matching algorithm with each of those pieces against the corresponding pieces from the URLs you want to test against. If all of the pieces match, the rule is a match.

Example

Rule: *.site.com/*
    domain => *.site.com
    path   => /*
    query  => [empty]

URL: sub.site.com/path/home.html
    domain => sub.site.com
    path   => /path/home.html
    query  => [empty]

Matching process:
    domain => *.site.com matches sub.site.com?     YES
    path   => /*         matches /path/home.html?  YES
    query  => [empty]    matches [empty]           YES

Result: MATCH

As you are storing the rules in a database I would store them already broken into those three pieces. And if you want uber-speed you could convert the *'s to %'s and then use the database's native LIKE operation to do the matching for you. Then you'd just have a query like

SELECT *
FROM   ruleTable
WHERE  @urlDomain LIKE ruleDomain
   AND @urlPath   LIKE rulePath
   AND @urlQuery  LIKE ruleQuery

where @urlDomain, @urlPath, and @urlQuery are variables in a prepared statement. The query would return the rules that match a URL, or an empty result set if nothing matches.

John Kugelman 2009-07-02 04:39:25

I had previously thought LIKE/GLOB could only match input patterns but not patterns in the rules (e.g. columns) themselves. This is working great. As a side note, I'm actually getting better performance by not breaking the rules up but rather to have them completed (and domains reversed as per Chris Harris' comment).

NuSkooler 2009-07-03 02:36:34

Answer 2

+2 A:

First of all, one of the worst performing searches you can do is with a wildcard at both ends of the string ".domain.com/path" -- and I think you're going to hit this case a lot. So my first recommendation is to reverse the order of the domains as they're stored in your DB: com.domain.example/path1/path2/page.html. That will allow you to keep things much more tidy and only use wildcards in "one direction" on the string, which will provide MUCH faster lookups.

I think John mentions some good points about how to do this all within your DB. If that doesn't work I would use a regex library in C++ against the list. I bet you'll get the best performance and most general regex syntax that way.

Chris Harris 2009-07-02 05:52:18

ansaurus

tags:

views:

answers:

Fast string matching algorithm with simple wildcards support

Example

related questions