tags:

views:

84

answers:

3

Let's say we have two regular expressions:

1234.*

and

.*

Input:

1234567

Obviously they both match, but 1234.* matches better since it is more specific. i.e. is more relevant. Is there a standard way for checking which is more relevant?

edit:

Some clarification. I want to make decisions by checking which regexp matches the input best. In this case I am only matching numbers.

Example with telephone numbers:

Input:

31882481337

We have a rule for each of the following regexps:

31.*
.*

In this scenario I would like the rule to be used that is bound to 31.* because that is more specific for the input given. If I was not using regexps it would be easy, because I could use a scoring mechanism to check how much it matches, however these rules may have some more advanced regexps, like:

31[89].*
+1  A: 

one factor i can think of is whether a language is infinite or not infinite. not infinite is definately more relevant than infinite as there are a finite number of acceptable words in the language.

if measuring infinite languages like your examples, both just go on forever and you can keep on counting each word in the language until you're blue in the face, you'll never reach a conclusion.

until you consider that the first regex's language is a proper subset of the second's language. Then you might say one is more relevent.

I'm not sure of any standard though of how to measure regex relevancy.

to expound on the idea of proper subsets, you may ask what is your language and does your regex accept words outside of that? your expression might still work, but it has a wider range of words than you intended... of course this may not matter if your input is controlled, but that's one way you could measure relevance. is it accepting my language exactly?

yours is a good example, perhaps you want to accept numbers starting with 1234. 1234.* works like a charm... but that isn't the language you specified. `1234\d* is more specific and matches your language exactly as you specified... thus being more relevant.

but this is all from a purely theoretical standpoint and probably won't help you much to programatically determine if one regex is better than another.

Victor
+3  A: 

I think there is no easy way to do this. If you look at complexer examples, you will soon realize that it is quite hard to exactly define "more relevant" at all. All the things like assertions and backreference come into play.

I can think of two ways to roughly estimate the "relevance".

  1. Randomly modify the input and compare how many modifaction cause each expression to fail.

  2. Analyze the expressions itselve. Count and comapre the number of terminal symbols vs wildcards, the number of assertion and whatever you like.

Esspecially in the second solution you have to be aware that many alternatives, that are not used by the actual matching, might render the result irrelevant.

h.*|verylongtext|anotherverylongtext

hell.*|v.*

When matching "hello", the second expression is "more relevant", but the first contains much more terminal symbols and might get a much better ranking by the second solution. But for matching "verylongtext" the first is "more relevant". This shows that the "relevance" heavily depends on the actual input and you would have to analyze the actual matching path - something that is implicitly done by the first solution. But randomly modifying the input is quite a hard task because the space of possible inputs is quite large. I think this will not work very well, too.

Daniel Brückner
A: 

I don't know whether "relevancy" is the real issue. Each is relevant, and each will match "1234567," as you suggest. As you also say, however, one ("1234.*") is more specific. With regular expressions, specificity is great (in a simple case like this), and sometimes you can hone in on it so far that you realize you didn't need one (a regex) after all. Rule #1 of regular expressions: Don't use them if you don't have to. For example, to match "1234567", I'd go with:

$source = '1234567';
if ( stripos( $source, '1234' ) === 0 ) {
  $foo = substr( $source, 4 );
  // $source began with '1234' and $foo holds the rest
} else {
  // it didn't begin with '1234'
}

That's a PHP example, but the idea is that, since you've honed your accepted value in so tightly, you don't even need PCRE anymore. "Relevancy" won't really tell you much about a regular expression (how would you define "relevancy" in this context?), however I think specificity a more objective measurement, and being able to use non-regex string functions instead would sure as heck be very measurably specific (in fact, it's boolean - are there regular expression or not?).

Outside of being able to reduce the regex out of the equation: To measure the specificity of a given regular expression, simply compare (heuristically, if necessary) how many different values would satisfy the expression. The expression with the least score in this test would prove the most specific.

Chris