ansaurus

Question

Answer 1

+5 A:

I would use Regex, just make sure to have it the expression compiled once (instead of it being calculated again and again).

eglasius 2009-03-05 00:32:36

Care to elaborate?

jro 2009-03-05 00:33:38

I think that regular expressions, even with precompilation, would be very much overkill and inefficient for matching of this simple nature. Then again, I think most naive approaches would also be very inefficient, so maybe this is the best easy solution. +1

Sparr 2009-03-05 00:47:01

Answer 2

+2 A:

You seem to have a well-defined set of rules regarding what you consider to be valid input - you might consider using a hand-written LL parser for this. Such parsers are relatively easy to write and optimize. Usually you'd have the parser output a tree structure describing the input - I would use this tree as input to a matching routine that performs the work of matching the tree against the list of entries, using the rules you described above.

Here's an article on recursive descent parsers.

Erik Forbes 2009-03-05 00:34:50

"you might consider using a hand-written LL parser for this" - don't you think that might be overkill?

Mitch Wheat 2009-03-05 00:35:43

Not if he's dead-set against Regexes.

Erik Forbes 2009-03-05 00:37:20

Besides - a regex engine is just a DFA - building your own DFA to consider *only* the rules you want can be faster, provided you know what you're doing.

Erik Forbes 2009-03-05 00:38:42

@Erik - yes I realise that.

Mitch Wheat 2009-03-05 00:44:41

Then why did you ask? Lol

Erik Forbes 2009-03-05 00:46:11

Answer 3

A:

Have a look at RegExLib

Mitch Wheat 2009-03-05 00:35:06

Answer 4

+3 A:

I would use a tree structure to store the rules, where each tree node is/contains a Dictionary.

Construct the tree such that "com", "net", etc are the top level entries, "example" is in the next level, and so on. You'll want a special flag to note that the node is a wildcard.

To perform the lookup, split the string by period, and iterate backwards, navigating the tree based on the input.

This seems similar to what you say you considered, but assuming the rules don't change each run, using a cached Dictionary-based tree would be faster than a list of arrays.

Additionally, I would have to bet that this approach would be faster than RegEx.

NilObject 2009-03-05 00:40:10

I was about to say something almost exactly like this, but I scrolled down and there it was! +1

Greg Hewgill 2009-03-05 00:43:20

Answer 5

A:

Not sure what your ideas were for splitting and iterating, but it seems like it wouldn't be slow:

Split the domains up and reverse, like you said. Storage could essentially be a tree. Use a hashtable to store the TLDs. The key would be, for example, "com", and the values would be a hashtable of subdomains under that TLD, iterated ad nauseum.

Ed Marty 2009-03-05 00:42:53

Answer 6

A:

Given your requirements, I think you're on-track in thinking about working from the end of the string (TLD) towards the hostname. You could use regular expressions, but since you're not really using any of the power of a regexp, I don't see why you'd want to incur their cost. If you reverse the strings, it becomes more apparent that you're really just looking for prefix-matching ('*.example.com' becomes: "is 'moc.elpmaxe' the beginning of my input string?), which certainly doesn't require something as heavy-handed as regexps.

What structure you use to store your list of entries depends a lot on how big the list is and how often it changes... for a huge stable list, a tree/trie may be the most performant; an often-changing list needs a structure that is easy to initialize/update, and so on. Without more information, I'd be reluctant to suggest any one structure.

JaredReisinger 2009-03-05 00:46:43

Answer 7

+1 A:

Assuming the rules are as you said: literal or start with a *.

Java:

public static boolean matches(String candidate, List<String> rules) {
    for(String rule : rules) {
        if (rule.startsWith("*")) {
            rule = rule.substring(2);
        }
        if (candidate.endsWith(rule)) {
            return true;
        }
    }
    return false;
}

This scales to the number of rules you have.

EDIT:

Just to be clear here.

When I say "sort the rules", I really mean create a tree out of the rule characters.

Then you use the match string to try and walk the tree (i.e. if I have a string of xyz, I start with the x character, and see if it has a y branch, and then a z child).

For the "wildcards" I'd use the same concept, but populate it "backwards", and walk it with the back of the match candidate.

If you have a LOT (LOT LOT) of rules I would sort the rules.

For non wildcard matches, you iterate for each character to narrow the possible rules (i.e. if it starts with "w", then you work with the "w" rules, etc.)

If it IS a wildcard match, you do the exact same thing, but you work against a list of "backwards rules", and simply match form the end of the string against the end of the rule.

Will Hartung 2009-03-05 00:53:02

Answer 8

+9 A:

e.James 2009-03-05 00:54:03

This is a great way to do it, assuming that a "*" can only appear at the end (which seems reasonable here). You might get a speedup if you flatten the trie to an array of strings at some depth (e.g. 8 chars) since at this depth the trie is quite sparse and following per-char pointers is slow.

j_random_hacker 2009-03-05 02:41:49

Good suggestion. The pointer following overhead will slow things down a touch. I wonder what the optimal array depth is? As usual, we're going to need some profiling to see if our solutions are actually helping :)

e.James 2009-03-05 03:23:09

This is best memory hog out there. ( i've actually implemnted this once ). With virtual memory, i'm not too sure of performance also.

Vardhan Varma 2009-03-05 15:41:07

There is no doubt that it will take up a lot of memory. The simplest approach would use around 50 to 100 times the memory of the original list. I imagine you'd be looking at about 15-20 MB for the first thousand entries.

e.James 2009-03-05 18:21:03

As with any design decision, the memory cost will have to be weighed against the performance improvement, and both will have to be measured. I described the fastest method I could think of here, because the OP was asking about performance.

e.James 2009-03-05 18:24:46

Answer 9

A:

I guess I am tempted to answer your question with another one: what are you doing that you believe your bottleneck is some string matching above and beyond simmple string-compare? surely something else is listed higher up on your performance profiling?

I would use the obvious string compare tests first that'll be right 90% of the time and if they fail then fallback to a regex

Scott Evernden 2009-03-05 00:55:55

Our string comparison tests are nowhere near 90% on first-pass.

jro 2009-03-05 06:48:09

Answer 10

+5 A:

you don't need regexp .. just reverse all the strings, get rid of '*', and put a flag to indicate partial match till this point passes.

Somehow, a trie or suffix trie looks most appropriate.

If the list of domains is known at compile time, you may look at tokenizing at '.' and using multiple gperf generated machines.

Links: google for trie http://marknelson.us/1996/08/01/suffix-trees/

Vardhan Varma 2009-03-05 00:56:13

Answer 11

+1 A:

I'd try a combination of tries with longest-prefix matching (which is used in routing for IP networking). Directed Acyclic Word Graphs may be more appropriate than tries if space is a concern.

Hank Gay 2009-03-05 01:38:58

Answer 12

A:

I'm going to suggest an alternative to the tree structure approach. Create a compressed index of your domain list using a Burrows-Wheeler transform. See http://www.ddj.com/architect/184405504?pgno=1 for a full explanation of the technique.

codeelegance 2009-03-05 04:01:26

Answer 13

A:

If it was just matching strings, then you should look at trie datastructures and algorithms. An earlier answer suggests that, if all your wildcards are a single wildcard at the beginning, there are some specific algorithms you can use. However, a requirement to handle general wildcards means that, for fast execution, you're going to need to generate a state machine.

That's what a regex library does for you: "precompiling" the regex == generating the state machine; this allows the actual match at runtime to be fast. You're unlikely to get significantly better performance than that without extraordinary optimization efforts.

If you want to roll your own, I can say that writing your own state machine generator specifically for multiple wildcards should be educational. In that case, you'll need to read up on the kind of algorithms they use in regex libraries...

2009-03-05 04:40:00

Answer 14

A:

Investigate the KMP (Knuth-Morris-Pratt) or BM (Boyer-Moore) algorithms. These allow you to search the string more quickly than linear time, at the cost of a little pre-processing. Dropping the leading asterisk is of course crucial, as others have noted.

One source of information for these is:

KMP: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html

BM: http://www-igm.univ-mlv.fr/~lecroq/string/node14.html

Jonathan Leffler 2009-03-05 05:15:49

ansaurus

tags:

views:

answers:

Efficient string matching algorithm

related questions