tags:

views:

70

answers:

3

I'm writing a program in .net where the user may provide a large number of regular expressions. For a given string, I need to figure out which regular expression matches that string (if more than one matches, I just need the first one that matches). However, if there are a large number of regular expressions this operation can take a very long time.

I was somewhat hoping there would be something similar to flex (The Fast Lexical Analyzer (not Adobe Flex)) for .net that would allow me to specify a large number of regular expressions yet quickly (O(n) according to Wikipedia for n = len(input string)) figure out which regular expression matches.

Also, I would prefer not to implement my own regular expression engine :).

+1  A: 

What? Even testing if a single regular expression matches cannot be done in O(n) time in general. Where did you get this information from? What feature is this in Flex? I am sure it must be some limited form of regular expressions, not for arbitrary .NET regular expressions.

To handle arbitrary regular expressions the simple way is to keep your regular expressions stored in a List and simply iterate over each regular expression one by one until you find one that matches.

Mark Byers
Testing a single regular expression can be done in O(n) time. Compiling the regular expression is the part that takes a long time (see http://en.wikipedia.org/wiki/Regex). From the article, "Constructing the DFA for a regular expression of size m has the time and memory cost of O(2^m), but it can be run on a string of size n in time O(n)."
Jack
@Jack That page talks about regular expressions in the theoretical sense (the one matching regular languages), not practical (perl-like) regexes. Wikipedia says: "For example, any implementation which allows the use of backreferences, or implements the various extensions introduced by Perl, must use a backtracking implementation."
Christian Semrau
@Jack: First: O(2^m) gets very big very quickly, is this actually possible in practice for real arbitrary .NET regular expressions? Second: compiling is part of the operation of taking a user inputted regular expression and checking it against a string. Are you proposing to ignore the compilation cost when measuring the runtime? Why?
Mark Byers
The reason I believe I can ignore it is because I can compile the regular expressions "off-line" before attempting to match them.
Jack
re: iterating over a List, I'd point out 'yield'
annakata
+1  A: 

Find the biggest chunk of constant text in each regex (if above a certain threshold length) and use the Karp-Rabin algorithm to search for any of those strings simultaneously. For each match, run that regex to see if the whole thing matches. For each regex not included in the multi string search, search that regex directly.

This should give you good performance for a large number of regular expressions if they have reasonable-length constant substrings, assuming you have preprocessing time available for the regular expressions.

Charles
A: 

A quick web search shows that there exists a lex like tool named C#Lex. But since I do not use .NET or C#, I cannot say whether it is good, and if it is useful for you.

For Java, I found JLex and JFlex, both of which generate source code. Using these seems only reasonable if the regular expressions are literally compiled "off-line" (outside the application), and then incorporated into your application classpath. The .NET version probably behaves similarly.

Christian Semrau