tags:

views:

1442

answers:

14

I'm trying to build an efficient string matching algorithm. This will execute in a high-volume environment, so performance is critical.

Here are my requirements:

  • Given a domain name, i.e. www.example.com, determine if it "matches" one in a list of entries.
  • Entries may be absolute matches, i.e. www.example.com.
  • Entries may include wildcards, i.e. *.example.com.
  • Wildcard entries match from the most-defined level and up. For example, *.example.com would match www.example.com, example.com, and sub.www.example.com.
  • Wildcard entries are not embedded, i.e. sub.*.example.com will not be an entry.

Language/environment: C# (.Net Framework 3.5)

I've considered splitting the entries (and domain lookup) into arrays, reversing the order, then iterating through the arrays. While accurate, it feels slow.

I've considered Regex, but am concerned about accurately representing the list of entries as regular expressions.

My question: what's an efficient way of finding if a string, in the form of a domain name, matches any one in a list of strings, given the description listed above?

+5  A: 

I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

eglasius
Care to elaborate?
jro
I think that regular expressions, even with precompilation, would be very much overkill and inefficient for matching of this simple nature. Then again, I think most naive approaches would also be very inefficient, so maybe this is the best easy solution. +1
Sparr
+2  A: 

You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.

Here's an article on recursive descent parsers.

Erik Forbes
"you might consider using a hand-written LL parser for this" - don't you think that might be overkill?
Mitch Wheat
Not if he's dead-set against Regexes.
Erik Forbes
Besides - a regex engine is just a DFA - building your own DFA to consider *only* the rules you want can be faster, provided you know what you're doing.
Erik Forbes
@Erik - yes I realise that.
Mitch Wheat
Then why did you ask? Lol
Erik Forbes
A: 

Have a look at RegExLib

Mitch Wheat
+3  A: 

I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.

Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.

To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.

This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.

Additionally, I would have to bet that this approach would be faster than RegEx.

NilObject
I was about to say something almost exactly like this, but I scrolled down and there it was! +1
Greg Hewgill
A: 

Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:

Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

Ed Marty
A: 

Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.

What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.

JaredReisinger
+1  A: 

Assuming the rules are as you said: literal or start with a *.

Java:

public static boolean matches(String candidate, List<String> rules) {
    for(String rule : rules) {
        if (rule.startsWith("*")) {
            rule = rule.substring(2);
        }
        if (candidate.endsWith(rule)) {
            return true;
        }
    }
    return false;
}

This scales to the number of rules you have.

EDIT:

Just to be clear here.

When I say "sort the rules", I really mean create a tree out of the rule characters.

Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).

For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.

If you have a LOT (LOT LOT) of rules I would sort the rules.

For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)

If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.

Will Hartung
+9  A: 
e.James
This is a great way to do it, assuming that a "*" can only appear at the end (which seems reasonable here). You might get a speedup if you flatten the trie to an array of strings at some depth (e.g. 8 chars) since at this depth the trie is quite sparse and following per-char pointers is slow.
j_random_hacker
Good suggestion. The pointer following overhead will slow things down a touch. I wonder what the optimal array depth is? As usual, we're going to need some profiling to see if our solutions are actually helping :)
e.James
This is best memory hog out there. ( i've actually implemnted this once ). With virtual memory, i'm not too sure of performance also.
Vardhan Varma
There is no doubt that it will take up a lot of memory. The simplest approach would use around 50 to 100 times the memory of the original list. I imagine you'd be looking at about 15-20 MB for the first thousand entries.
e.James
As with any design decision, the memory cost will have to be weighed against the performance improvement, and both will have to be measured. I described the fastest method I could think of here, because the OP was asking about performance.
e.James
A: 

I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?

I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex

Scott Evernden
Our string comparison tests are nowhere near 90% on first-pass.
jro
+5  A: 

you don't need regexp .. just reverse all the strings, get rid of '*', and put a flag to indicate partial match till this point passes.

Somehow, a trie or suffix trie looks most appropriate.

If the list of domains is known at compile time, you may look at tokenizing at '.' and using multiple gperf generated machines.

Links: google for trie http://marknelson.us/1996/08/01/suffix-trees/

Vardhan Varma
+1  A: 

I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

Hank Gay
A: 

I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

codeelegance
A: 

If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.

That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.

If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...

A: 

Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.

One source of information for these is:

KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Jonathan Leffler