tags:

views:

100

answers:

3

My task is to determine if a string should cause a command to be raised in my C# command processor, this being configurable at runtime.

I have two collections of regexes. For the first collection 'EnabledPatterns' the command is enabled if any one of the patterns matches the command string. For the second 'DisabledPatterns' the command is disabled if any of the patterns match the command string. It is possible that there is overlap, where a pattern in EnabledPattern matches the same string as a pattern in DisabledPattern. For example:

EnablePatterns contains ^ABC.$ DisabledPatterns contains ^A.*$

Given the above example I'd like to remove the EnabledPatterns entry, thus diabling command processing for commands matching the 'EnabledPatterns' regex.

Is there a way to do this as a preprocess, each time any of the collections is changed so that I can build a list of active, enabled patterns without having the command string to hand. I essentially want to determine if there are any matches possible when the overlaps are removed.

A: 

Why would you need both a whitelist and a blacklist? I think you are struggling with a logical problem here, write it all out and have a think over what you intend.

Gregory
Because the user may want to disable a subset of enabled commands during a night shift for example.
Dave Arkley
You're overthinking. Run commands enabled. Get Result. Run commands disabled against the first result. Remeber the poor bastard who has to maintain your solution in the future!
Gregory
+2  A: 

I would say your best bet is to just check the disabled patterns first, and then only check the enable patterns if the disabled patterns all returned false.

Edit:

Ok... I just worked through this with some one in the office...

It turns out there is a N^4 algorithm that you can use to do this.

Basically, you need to:

  1. Convert both regexes into a NFAs
  2. Convert the NFAs into DFAs
  3. Compare the DFAs to see if one is a subset of the other

To compare the DFAs you need to do a parallel depth first search of both graphs, starting at the start node, and following the inputs. Each time you follow an input in each graph you build a composite node (n1, n2) with n1 being the node from the first graph and n2 being the node from the second graph.

If either node is an accepting state, mark it with a *, so that you would have (n1*,n2) (n1, n2*), (n1*,n2*), or (n1,n2).

Once you are done, look at all the edges where n1 is an accepting state (n1*, x). if X is also an accepting state, then regex1 is a subset of regex2.

This algorithm should be at worst case O(n^4). It might be better, but it won't be worse than that.

I'll see about posting some code in a bit.

Obviously, however, you would only want to do this if the volume of changing regexes was significantly lower than the volume of incoming things you were checking. Otherwise the expense in canceling out the regexes would cancel out any perf benefits from having fewever regexes.

Edit 2:

My analysis of O(n^4) run time was based on the number of states in the larger DFA. This, however, neglected the fact that the number of states in the DFA, after conversion from an NFA, can end up being 2^M, where M is the number of states in the NFA.

This means that the actual complexity ends up being something like:

O(2^4M)

Which is non polynomial (which is bad).

Most of the time, the number of NFA states won't get anywhere near that high, so it shouldn't be too slow in practice.

However, I would highly recommend running anything like this as an offline process, not as an online one.

Edit 3:

It is possible to convert a regular expression directly into a DFA. However, I can't find the description of the algorithm online, every one just points to the implementation in the dragon book. Unfortunately, I don't have a copy of it right now (this is shameful, I know). So for now I'm just going to go ahead with the regex=>nfa=>dfa approach. After I find a library, get the book, and grab the algorithm out of it, I'll look into modifying the code to use the direct approach.

Scott Wisniewski
I was trying to reduced the lists before processing any commands - the rate of arrive commands can be prodigious and if I can return quickly because I've pre-calculated that nothing is enabled.
Dave Arkley
I can see that. I just posted an update with something that might work. If I have some time, I'll post some code.
Scott Wisniewski
Scott,Appreciated - I need to read up on NFAs/DFAs to understand your post.
Dave Arkley
I'm trying to get you a Code Sample, which might help. Right now I've written a RegEx parser (for a subset of the language), but I still need to write the DFA / NFA generation, and then the comparison logic. It will be a little while though, because I have other work to do. Once I get it done I'll post it for you. It won't completely solve your problem,because you'll need to extend it cover the full regex language, but it should be a good starting point. Also, you need to make sure that my interpretation of regexes matches the one your library uses. Are you ok with the MIT license?
Scott Wisniewski
Scott, I found this implementation of NFA/DFA conversion so I have a DFA for my regexes. MIT license is cool. http://www.dina.dk/~sestoft/gcsharp/GNfaToDfa.csAgain thank you for your efforts, this is my first post on SO and if you are typical of responders then it's a great resource.
Dave Arkley
I just wanted to give you a heads up on this. It's going to be a little while. I had to put this on the back burner because of some work I need to on my product. This is still on my radar though.
Scott Wisniewski
A: 

What you want to do boils down to whether the AND of two regular expressions is empty (i.e. doesn't match anything) or not. Theoretically, this is possible to compute. The AND of two regular languages is also regular. And the question of whether a regular language is empty is decidable.

However, this probably doesn't help you much because you probably don't want to be simulating finite automata in your program.

newacct