views:

415

answers:

1

Anyone know how one might adapt a search tree to handle limited regular expressions? The task is, given a file name, find all nodes matching that file name. Nodes may contain usual file name globs (* and ?). Obviously, since this is a search tree, speed is of the essence.

EDIT: I should add that the most important case for speed is the average time to rule out a match. That is, in most cases matching will fail.

An example: Say the tree contained the following nodes:

foo, bar, foo*, *bar, foo?bar

Searching for foo would return nodes 1 and 3. Searching for bar would return nodes 2 and 4. Searching for fob would return no nodes. Searching for fooxbar would return node 5. Searching for foobar would return nodes 3 and 4.

+8  A: 

An aho-corasick search tree would fit the bill. Aho-Corasick a very good article about this sort of thing Tries, and the implementation used in Evolution to replace regex searching Etrie

Edit: To do the whole string matching, you can add beginning and ending anchor states, if scanning multi line data, you could add the newline to begin and end. You could also remove the part where it add the cross linking for the partial matching starting a different match, this also allows faster exclusion.

Another algorithm for checking for membership in a string set is CritBit. This doesn't have Regex, but it simple and test complete strings.

sfossen
That looks very promising, although I want to match the whole input string, not substrings within it. I'll read the links and confirm it fits the bill.
Kris Braun
You can add a new front of line anchor, or if scanning multi line haystacks and add the line ending to the front of the needle. eg "\nsearch string".
sfossen