views:

139

answers:

2

I'm looking for a bit of code that will:

Given regular expression E, derive the longest string X
such that for every S, X is a substring of S iff S will match E

examples:

E = "a", X = "a"
E = "^a$", X = "a"
E = "a(b|c)", X = "a"
E = "[ab]", X = ""

context: I want to match some regular expressions against a data store that only supports substring searching. It would be nice to optimize the regular expression searching by applying a substring search to the data store to reduce the amount of data transferred as much as possible.

example 2:

If I want to catch "error foo", "error bar", "error baz", I might specify

error: (foo|bar|baz)

and send

search "error: "

to the data store, and then regexping the returned items.

Thanks!

+1  A: 

In general terms, you could try to split the regex at all non-unique ((a|b), [ab]) matches, and then look for the longest string in the resulting array. Something like

$foo = longest(regex_split($regex, '(\(.*?\|.*?\))|(\[.*?\])'));
l0b0
+1  A: 

Maybe convert RE to a finite state automata and look for the longest part that needs to be present in a path between start and finish states... Geometric thinking with a graph can be easier to you, at least it is in my case.

Tadeusz A. Kadłubowski