views:

106

answers:

4

I have several regexes (actually several thousands), and I must check if one string matches any of these regexes. It is not very efficient, so I would like to merge all these regexes as a single regex.

For example, if a have these regexes:

  • 'foo *bar'
  • 'foo *zip'
  • 'zap *bar'

I would like to obtain something like 'foo *(bar|zip)|zap *bar'.

Is there some algorithm, library or tool to do this?

A: 

I can't imagine, even if possible, that the resultant regex would be any more efficient.

hometoast
I disagree; a regex search for "foo (?:bar|baz)" is going to be quicker than a search for "foo bar" and a search for "foo baz", as searching separately would require matching (or not) the "foo" part twice.
me_and
-1 The way in which the automaton is built will automatically optimize many cases. On top of that, you can optimize the resulting state machine further (see Vlad's answer).
Aaron Digulla
me ~= corrected. Thanks!
hometoast
+7  A: 

You can just concatenate the regexes using or (|) (and anchors for the beginning/end of string).

Most good regex libraries optimize their finite state automata after they build it from your regex. PCRE does that, for instance.

This step usually takes care of your optimization problem, ie. they apply most of the transformations you would have to do "by hand".

Vlad Valceanu
A: 

I very much doubt it, on the grounds that any such tool would have to be very complex to deal with all the different ways in which a regex could be combined.

If the regexes you have are relatively simple, such as in your examples, you may have some luck writing your own, however.

me_and
+1  A: 

In theory a regex is a (nondeterministic)finite-state automata; thus they can be merged and minimized. You can take a look at this as a starting point.

Beware, though, that this might not be the most correct answer. Why do you have to deal with several thousands regular expressions? I can only fathom the maintentance hell of such a thing. Perhaps you should consider writing a parser and a grammar -- much easily done (and grammars are more powerful than regexps anyways).

lorenzog