views:

149

answers:

3

I have two algorithmic questions for a project I am working on. I have thought about these, and have some suspicions, but I would love to hear the community's input as well.

  1. Suppose I have a string, and a list of N regular expressions (actually they are wildcard patterns representing a subset of full regex functionality). I want to know whether the string matches at least one of the regular expressions in the list. Is there a data structure that can allow me to match the string against the list of regular expressions in sublinear (presumably logarithmic) time?

  2. This is an extension of the previous problem. Suppose I have the same situation: a string and a list of N regular expressions, only now each of the regular expressions is paired with an offset within the string at which the match must begin (or, if you prefer, each of the regular expressions must match a substring of the given string beginning at the given offset).

To give an example, suppose I had the string:

This is a test string

and the regex patterns and offsets:

(a) his.*  at offset 0
(b) his.*  at offset 1

The algorithm should return true. Although regex (a) does not match the string beginning at offset 0, regex (b) does match the substring beginning at offset 1 ("his is a test string").

Is there a data structure that can allow me to solve this problem in sublinear time?

One possibly useful piece of information is that often, many of the offsets in the list of regular expressions are the same (i.e. often we are matching the substring at offset X many times). This may be useful to leverage the solution to problem #1 above.

Thank you very much in advance for any suggestions you may have!

A: 

(1) Combine all the regular expressions as a big union: (r1)|(r2)|(r3)|...

(2) For each regex with offset n add n dots to the beginning plus an anchor. So his.* at offset 6 becomes ^......his.*. Or if your regex syntax supports it, ^.{6}his.*.

John Kugelman
+2  A: 

I will assume that you really have the power of regular expressions.

  1. To determine whether a string is matched by one of expressions e_1, e_2, ..., e_n, just match against the expression e_1 + e_2 + ... + e_n (sometimes the + operator is written as |).

  2. Given expression-offset pairs (e_1, o_1), ..., (e_n, o_n) and a string, you can check whether there is i such that the string is matched by expression e_i at offset o_i by matching against the expression .{o_1}e_1 + ... + .{o_n}e_n.

Depending on the form of the individual expressions, you can get sublinear performance (not in general though).

avakar
I'll respond here since this answer is top-voted, although Christian Semrau and John Kugelman's answers below are basically the same. First of all, thank you all very much for your replies! I appreciate the help.I'd be surprised if this solution were practical ... I was under the impression that compiling a DFA from a regular expression took O(2^m) time and space for a regex of length m, and with N ~ tens of thousands of regexs to match and offsets as large as a few million, this would seem unmanageable. The length of the union of regexs would be just too huge. Am I mistaken?
DSII
The bound of O(2^m) is a worst case upper bound, as given by the [powerset construction](http://en.wikipedia.org/wiki/Powerset_construction). After a few applications of that construction to wildcard patterns (containing only "*" and "?" as wildcards, meaning ".*" and ".?" as regex), it seems to me that a DFA has a similar number of states as the NFA, which has a number of states similar to the length of the pattern. Of course, the union of N such DFAs may very well expode when converted to a DFA. And start offsets are not even considered. So, you are right: The automaton approach fails here.
Christian Semrau
Thank you again for the reply. I think you are right. Is there any other alternative to naive linear search then?
DSII
A: 

If your expressions are sufficiently simple (wildcard patterns are), AND your set of expressions are predetermined, i.e. only the input to be matched changes, THEN you may construct a finite state machine that matches the union of your expressions, i.e., the expression "(r1)|(r2)|...".

Constructing that machine takes time and space at least O(N) (but I guess it is not exponential, which is the worst case for regular expressions in general). Matching is then O(length(input)), independent of N.

OTOH, if your set of expressions is to be part of the program's input, then there is no sublinear algorithm, simply because each expression must be considered.

Christian Semrau