views:

149

answers:

2

Hello.

I have n regexes. On input a string i have to find all the regexes that this string matches against. This is normally an O(n) operation:

SELECT regex FROM regexes WHERE 'string' RLIKE regex

Now I wonder, what's the fastest way to do this ? Are there database, structures or storage systems that are optimized to do such an operation ?

+3  A: 

The short answer is 'No.' There is no index structure currently available on any DBMS platform that will index partial matches of a regex like this.

The long answer is that a leading constant on a wildcard match (e.g. 'foo_') can be used as a prefix for index matches. Many DBMS platforms will optimise this and use an index (if available) to resolve the prefix. However, this is not anything like as clever as a full regex, and the indexing can only be used if you have a constant prefix.

The even longer answer is that there are algorithms such as RETE that will optimise partial matches like this. This might be applicable if you can express your matches as forward-chaining production rules rather than regular expressions.

Rete works by computing partial matches and only presenting rules that can be reached from this partial match, so it is more efficient than O(n) (more like O(log n) but I'm not sure of the exact time complexity) for matching n rules against a fact.

ConcernedOfTunbridgeWells
+1  A: 

One optimisation you could implement, if applicable to your case, would be to categorise your Regexes and organise them in hierarchies so that:

  • you only have to test a handful of most-general Regexes.

  • for any general regex that matches, then proceed to test the string against all regexes of the same category only.

For instance, if your input strings can be anything arbitrarily complex and you have thousands of regexes, you could organise them in categories like:

  • the \d+ category, which would test number patterns (SSN, telephone numbers etc)

  • the <.*?> category, which would test the presence of HTML tags

  • the \w+@\w+ category, which could test the presence of emails addresses

etc.

If any root pattern doesn't match, then you avoid having to test whole ranges of patterns that would fail anyway.

Don't know if that would match your exact domain, but it's a possible optimisation.

Renaud Bompuis