views:

157

answers:

6

Can we compute a sort of distance between regular expressions ?

The idea is to mesure in which way two regular expression are similar.

+3  A: 

There are a few of metrics you could use:

  1. The length of a valid match. Some regexs have a fixed size, some an upper limit and some a lower limit. Compare how similar their lengths or possible lengths are.

  2. The characters that match. Any regex will have a set of characters a match can contain (maybe all characters). Compare the set of included characters.

  3. Use a large document and see how many matches each regex makes and how many of those are identical.

Are you looking for strict equivalence?

David Kanarek
+1: I prefer this answer to the current top-voted because you have made a very pragmatic list of concrete suggestions that are easily implementable.
Mark Byers
+1  A: 

I think first you need to understand for yourself how you see a "difference" between two expressions. Basically, define a distance metric.

In general case, it would be quite different to make. Depending on what you need to do, you may see allowing one different character in some place as a big difference. In the other case, allowing any number of consequent but same characters may not yield much difference.

I'd like to emphasize as well that normally when they talk about distance functions, they apply them to..., well, let's call them, tokens. In our case, character sequences. What you are willing to do, is to apply this method not to those tokens, but to the rules a multitude of tokens will match. I'm not quite sure it even makes sense.

Still, I believe we could think of something, but not in general, but for one particular and quite restricted case. Do you have some sort of example to show us?

Developer Art
+4  A: 

You can build deterministic finite-state machines for both regular expressions and compare the transitions. The difference of both transitions can then be used to measure the distance of these regular expressions.

Gumbo
Perhaps go one step ahead, convert the state machine into a graph representation and look for isomorphism?
Noufal Ibrahim
How would you compare the two reasonably similar regular expressions '\w+\d+' and '[a-zA-Z]{1,63}[1-9][0-9]{,3}' using this method? How can you tell if two states in different FSMs are "equivalent" or "similar"?
Mark Byers
@Noufal Ibrahim: Yes, I actually meant something like that. There are also algorithms that can tell if two finite-state machines are equivalent.
Gumbo
@Mark Byers: The actual question is how to measure the similarity. How similar is `\w` to `[a-zA-Z]`, `+` to `{1,63}`, `\d` to `[1-9]`, and `*` to `{,3}`?
Gumbo
Yes, I understand that it is difficult to measure similarity, I just don't see how building deterministic state machines helps at all. What is 'the difference of both transitions'? How would you determine that two non-identical states in the middle of two different FSMs are equivalent enough that it makes sense to measure 'the distances of their transitions'? How would you define a mapping between the states of the FSMs? Could you please expand on your answer? Whilst the idea seems interesting, I don't understand how this could ever work in practice. Do you know of a real example of this?
Mark Byers
+2  A: 

If you have two regular expressions and have a set of example inputs you could try matching every input against each regex. For each input:

  • If they both match or both don't match, score 0.
  • If one matches and the other doesn't, score 1.

Sum this score over all inputs, and this will give you a 'distance' between the regular expressions. This will give you an idea of how often two regular expressions will differ for typical input. It will be very slow to calculate if your sample input set is large. It won't work at all if both regexes fail to match for almost all random strings and your expected input is entirely random. For example the regex 'sgjlkwren' and the regex 'ueuenwbkaalf' would probably both never match anything if tested on random input, so this metric would say the distance between them is zero. That might or might not be what you want (probably not).

You might be able to analyze the structure of the regex and use biased random sampling to deliberately hit strings that match more frequently than in completely random input. For example, if both regex require that the string starts with 'foo', you could make sure that your test inputs also always start with foo, to avoid wasting time testing strings that you know will fail for both.

So in conclusion: unless you have a very specific situation with a restricted input set and/or restricted regular expression language, I'd say its not possible. If you do have some restrictions on your input and on the regular expression, it might be possible. Please specify what these restrictions are and maybe I can come up with something better.

Mark Byers
+2  A: 

I suppose you could compute a Levenshtein Distance between the actual Regular Experssion strings. That's certainly one way of measuring a "distance" between two different Regular Expression strings.

Of course, I think it's possible that regular expressions are not required here at all, and computing the Levenshtein Distance of the actual "value" strings that the Regular Expressions would otherwise be applied to, may yield a better result.

CraigTP
Note that a distance measure for regular expressions is something entirely different then a distance measure for strings. E.g. `distance(regex("a|b"), regex("b|a")` is by definition 0. And some changes are MUCH more significant than others. `abcde` may be similar to `bacde`, just two characters swapped but `^[0-9]` is entirely unlike `[^0-9]`
MSalters
A: 

There's an answer hidden in an earlier question here on SO: Generating strings from regexes. You can calculate an (asymmetric) distance measure by generating strings using one regex and checking how many of those match the other regex.

This can be optimized by stripping out shared prefixes/suffixes. E.g. a[0-9]* and a[0-7]* share the a prefix, so you can calculate the distance between [0-9]* and [0-7]* instead.

MSalters