ansaurus

Question

Answer 1

+1 A:

one factor i can think of is whether a language is infinite or not infinite. not infinite is definately more relevant than infinite as there are a finite number of acceptable words in the language.

if measuring infinite languages like your examples, both just go on forever and you can keep on counting each word in the language until you're blue in the face, you'll never reach a conclusion.

until you consider that the first regex's language is a proper subset of the second's language. Then you might say one is more relevent.

I'm not sure of any standard though of how to measure regex relevancy.

to expound on the idea of proper subsets, you may ask what is your language and does your regex accept words outside of that? your expression might still work, but it has a wider range of words than you intended... of course this may not matter if your input is controlled, but that's one way you could measure relevance. is it accepting my language exactly?

yours is a good example, perhaps you want to accept numbers starting with 1234. 1234.* works like a charm... but that isn't the language you specified. `1234\d* is more specific and matches your language exactly as you specified... thus being more relevant.

but this is all from a purely theoretical standpoint and probably won't help you much to programatically determine if one regex is better than another.

Victor 2009-07-14 20:50:27

Answer 2

+3 A:

I think there is no easy way to do this. If you look at complexer examples, you will soon realize that it is quite hard to exactly define "more relevant" at all. All the things like assertions and backreference come into play.

I can think of two ways to roughly estimate the "relevance".

Randomly modify the input and compare how many modifaction cause each expression to fail.
Analyze the expressions itselve. Count and comapre the number of terminal symbols vs wildcards, the number of assertion and whatever you like.

Esspecially in the second solution you have to be aware that many alternatives, that are not used by the actual matching, might render the result irrelevant.

h.*|verylongtext|anotherverylongtext

hell.*|v.*

When matching "hello", the second expression is "more relevant", but the first contains much more terminal symbols and might get a much better ranking by the second solution. But for matching "verylongtext" the first is "more relevant". This shows that the "relevance" heavily depends on the actual input and you would have to analyze the actual matching path - something that is implicitly done by the first solution. But randomly modifying the input is quite a hard task because the space of possible inputs is quite large. I think this will not work very well, too.

Daniel Brückner 2009-07-14 20:51:17

Answer 3

A:

I don't know whether "relevancy" is the real issue. Each is relevant, and each will match "1234567," as you suggest. As you also say, however, one ("1234.*") is more specific. With regular expressions, specificity is great (in a simple case like this), and sometimes you can hone in on it so far that you realize you didn't need one (a regex) after all. Rule #1 of regular expressions: Don't use them if you don't have to. For example, to match "1234567", I'd go with:

$source = '1234567';
if ( stripos( $source, '1234' ) === 0 ) {
  $foo = substr( $source, 4 );
  // $source began with '1234' and $foo holds the rest
} else {
  // it didn't begin with '1234'
}

That's a PHP example, but the idea is that, since you've honed your accepted value in so tightly, you don't even need PCRE anymore. "Relevancy" won't really tell you much about a regular expression (how would you define "relevancy" in this context?), however I think specificity a more objective measurement, and being able to use non-regex string functions instead would sure as heck be very measurably specific (in fact, it's boolean - are there regular expression or not?).

Outside of being able to reduce the regex out of the equation: To measure the specificity of a given regular expression, simply compare (heuristically, if necessary) how many different values would satisfy the expression. The expression with the least score in this test would prove the most specific.

Chris 2009-07-14 21:18:07

ansaurus

tags:

views:

answers:

How to check a regexp's relevance?

related questions