views:

1104

answers:

16

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches. The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?

EDIT: I have tried substituting it with DFA's (lex) The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.

A: 

Rather strange, the reverse is much more common and easier... I fear Jason is right, I see no way other than iteration.

PhiLho
A: 

Try combining them into one big regex?

Justice
That's certainly worth trying - hopefully, the regex compiler will optimize things a bit...
Mark Bessey
that's usually significantly slower
Jimmy
This is actually a perfectly reasonable and fast solution - if you're using the straightforward NFA-simulation based regex. Unfortunately, most regex implementations don't - so this won't work in practice.
Eamon Nerbonne
+5  A: 

You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match

Eric Wendelin
A: 

I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.

However, it seems that your solution of trying each one iteratively is going to be far easier.

warren
+3  A: 

You could combine them in groups of maybe 20.

(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)

As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.

If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...

MizardX
+5  A: 

10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?

As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.

This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.

If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).

Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.

akdom
Any libraries that can do this automatically? This cant be handled manually. The regexes are not hard coded.
Sridhar Iyer
I don't know of any libraries that do this, but you could write something to parse the regexen and "file" those you have under testable cases. Even if you can only do this very broadly, you could possibly cut the time to execute significantly by ruling out large swaths of what doesn't fit.
akdom
esmre [http://code.google.com/p/esmre/] is a Python/C library that does something similar automatically.
Will Harris
+1: Really like the hierarchy idea. When you're dealing with these relatively large numbers it is likely you can classify them in some way.
Marc
By building a "tree" structure; you're essentially one step towards implementing a DFA/NFA. You're better off using a real (extremely fast) DFA/NFA matcher, rather than having the complexity of both worlds.
Eamon Nerbonne
Seeing as I've had exactly this problem before, if you can't vouch for the "safety" of the Regexes, it's quite possible a particular regex may expose terribly poor performance if you're using PCRE; a union of 10000 regexes is liable to be fragile (perf-wise) if they're implemented ala Perl (or .NET etc.).
Eamon Nerbonne
A: 

I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.

The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.

Marcin
+9  A: 

We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.

Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.

Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).

The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.

Tim Farley
An FSA is definitely the way to go, here!
Eamon Nerbonne
This was already implemented by us, but unfortunately the disadvantages you mention is why we are looking for a different solution.
Sridhar Iyer
+8  A: 

This is the way lexers work.

The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).

The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.

There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.

You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.

Remo.D
Autogenerating a lex file would do the job - I hope you're a master of your make system if you go down this path ;-)
ijw
Great observation! This gets my vote. I'm sure there's an easier way to re-use the lexer code without resorting to generating source files.
Ramon
+2  A: 

If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:

(r1)|(r2)|(r3)|...|(r10000)

where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.

Denis Bueno
I wish there would actually be such regular expression implementations in real-world languages. Unfortunately, perl screwed up and everyone else copied them... - so this won't work in almost any regex engine.
Eamon Nerbonne
+2  A: 

I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.

disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.

Javier
PEG's are far more powerful (and slow) than required. The union of a set of regular languages is regular; i.e. a plain old NFA suffices.
Eamon Nerbonne
A: 

You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".

Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.

ceretullis
+6  A: 

I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.

I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.

I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.

Will Harris
+6  A: 

If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.

+4  A: 

Martin Sulzmann Has done quite a bit of work in this field. He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.

The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).

The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.

Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.


  1. this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!
ShuggyCoUk
Am accepting this answer, because I've gone through all the other routes and failed (yes I have really implemented all the other solution and find them falling short in many areas). I've taken up the project to port the library from Haskell to C++.. might open source it later.This might not really work out later but this does seems promising and theoretically sound.
Sridhar Iyer
good luck on the port. I could see it being very popular!Do let us know if it works
ShuggyCoUk
@shridhar lyer: Any thoughts on progress? I'm faced with a similar situation
Toad
A: 

The fastest way to do it seems to be something like this (code is C#):

public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
    List<Regex> matches = new List<Regex>();
    foreach (Regex r in regexes)
    {
        if (r.IsMatch(string))
        {
            matches.Add(r);
        }
    }
    return matches;
}

Oh, you meant the fastest code? i don't know then....

RCIX
Premature optimisation blah blah... See how much too slow your easiest solution is, especially since the code to implement it is trivial.
ijw
What then would you suggest for being more efficient?
RCIX
If you are testing 10,000 regexes, it'll be very slow. You need some way to combine the tree to get a single pass parser.
Sridhar Iyer