tags:

views:

206

answers:

5

What is the most efficient wildcard string matching algorithm? I am asking only about an idea, it is not necessary to provide actual code.

I'm thinking that such algorithm can be built with sorted suffix arrays, which can yield performance of O(log(n)).

Am I correct?

Edited:

I mean patterns like "A*B", "*sip*" or "A?B" where star means any number of symbols and question mark means single symbol.

+1  A: 

Hm, I think that normal pattern matching rules would apply here. Usually, since you have a stream of data and short patterns, you would not need to implement something more efficient than linear. However, the longer the pattern gets, the more room there is for optimization.

What kind of wildcard do you have in mind? a one-character-wildcard (e.g. . in regex), or a multiple-character-wildcard (e.g. .*)? Are there limitations? What is the expected pattern length, and do you have random or serial access to the data to be checked?

Lucero
A: 

The performance will not just depend on the length of the string to search but also on the number (and type) of wildcards in the query string. If you are allowed to use a * which matches any number of characters, up to and including the entire document, and you can have any number of stars, this will put some limits on what is possible to get.

If you can determine a match some string foo in O(f(n)) time, then the query foo_0*foo_1*foo_2*...*foo_m will take O(m*f(n)) time where m is the number of * wildcards.

Mark Byers
A: 

Depending on the wildcard "language", I'd (probably) simply write a wildcard->regexp compiler and use the (commonly provided) regexp engine to do the actual matching. It's a bit lazy, but unless there's an actual performance problem doing it that way, it'd be fast enough.

Vatine
A: 

You could convert your wildcard query into a regular expression and use that to match; an RE can always be transformed into a DFA (discrete finite automaton) and those are efficient (lineair time) and a small constant.

Ritsaert Hornstra
But the intention is not to re-use existing functionality, but to design an algorithm for this particular purpose.
Vitaliy Liptchinsky
A: 

There is a paper covering the fastest options here http://swtch.com/~rsc/regexp/regexp1.html in particular it allows you avoid naive algorithms that become pathologically slow when long patterns are used.

It covers generic regular expressions but you can limit your implementation to the subset you require.

martinwguy