ansaurus

Question

Is there a compelling reason to use quantifiers in Perl regular expressions instead of just repeating the character?

Answer 1

+12 A:

They do the exact same thing, so as far as practicality it's a matter of preference. Is there a tiny performance difference one way or the other? Who knows but it's surely insignificant.

The quantifiers are more useful (and required) when the pattern length isn't fixed, for example \d{12,16}, \d{2,}, etc.

I prefer \d{4} which is easier for my brain to parse than \d\d\d\d

Also what if you're matching a character class rather than a simple digit? [aeiouy0-9]{4} or [aeiouy0-9][aeiouy0-9][aeiouy0-9][aeiouy0-9] ?

Rob 2010-03-30 18:28:11

Your [aeiouy0-9] argument is a good reason to learn to do it "properly"

justintime 2010-03-30 18:39:04

OK, now what the heck would be a practical application of "[aeiouy0-9]" character class??? :)

DVK 2010-03-30 19:01:23

I think so too. Great point.

Morinar 2010-03-30 19:01:26

@DVK - canadian postal code maybe? Those wacky canucks and their alpha-numeric postal codes :P

Rob 2010-03-30 19:33:39

That character class wouldn't be could for a Canadian post code. The numbers and letters show up in different positions.

brian d foy 2010-03-30 19:39:31

@DVK: That's not the main point :) (Maybe `[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]` is a better demo.)

KennyTM 2010-03-30 20:10:16

I know, I just wanted to know if that class came off the ceiling or from real example :)

DVK 2010-03-30 20:43:02

Answer 2

+5 A:

Any repetition of more than 3 or 4 will be hard to count at a glance. I consider this a compelling reason. On top of that, using the quantifier is a "denser" way to express the repeated information. To me, it's like the difference between copy-and-paste code "reuse" versus writing truly reusable code.

Matt Ball 2010-03-30 18:29:26

Answer 3

+3 A:

It's best to think that when he wants to find a set of 10+ letters he will have to use the quantifier rather than repetition, it's better to get used to the right way, besides, if he insists on using repetition for larger sets of characters, someone will have some trouble while trying to count them, which would not be needed if it was marked with a quantifier.

MarceloRamires 2010-03-30 18:30:25

Answer 4

+15 A:

There's no such thing as absolute readability. There's what people can individually recognize, which is why people often understand their code while nobody else can. If he never uses quantifiers, he's always going to think quantifiers are hard to read because he never learns to grok them.

I most often find that people say "more readable" when they really mean "that's what I know already" or "that's what I wrote the first time". That's not necessarily the case here, though.

An absolute quantifier like {4} is just easier to specify and communicate to other programmers. Who wants to count the number of \ds by hand? You write code for other people to read, so don't make their life harder.

However, you might have missed the bug in that code because you were focused on the quantifier issue. The $ anchor allows a newline at the end of the string, and if a Perl Best Practices zealot comes along and blindly adds /xsm to all regexes (a painful experience I've seen more than a few times), that $ allows even more invalid output. You probably want the \z absolute end-of-string anchor instead.

Not that it happened in your case, but code reviews tend to turn into style or syntax reviews (because those are easier to notice) and actually miss the point of checking for proper and intended behavior and correct design. Often the style problems aren't worth worrying about considering all of the other ways you could spend time to improve code. :)

brian d foy 2010-03-30 18:37:45

Good point on the anchor stuff although that wasn't at all applicable in this case (These strings are pre-cleansed to not have newlines), that is information I didn't realize and will definitely be filing away in my toolbox.

Morinar 2010-03-30 19:03:21

"These strings are pre-cleansed". Heh, heard that one before. :)

brian d foy 2010-03-30 19:10:32

:-) You and me both.

Morinar 2010-03-30 19:12:34

Answer 5

+2 A:

{4} is easier to maintain than \d\d\d\d because it scales better. For example, if you later need to change it to match 11 digits, you could simply change the 4 to an 11, instead of having to add 14 characters to your regex.

toolic 2010-03-30 20:06:35

Answer 6

A:

About readability... some Perl programmers uses very rare features, hoping them to be readable, however, it requires the understanding of that rare feature.

There are many regexp newbies who do not understand what {4} is.

About benefits, the second one may be better because it takes less array elements in the regexp engine. Unless you are a Real Programmer, you won't be optimizing performance to nanoseconds.

SHiNKiROU 2010-03-30 23:11:03

Answer 7

+8 A:

I'm just going to sidestep the issue of readability for now.

First lets look at what each version compiles down to.

perl -Mre=debug -e'/^\d{4}$/'

Compiling REx "^\d{4}$"
synthetic stclass "ANYOF[0-9][{unicode_all}]".
Final program:
   1: BOL (2)
   2: CURLY {4,4} (5)
   4:   DIGIT (0)
   5: EOL (6)
   6: END (0)
anchored ""$ at 4 stclass ANYOF[0-9][{unicode_all}] anchored(BOL) minlen 4 
Freeing REx: "^\d{4}$"

perl -Mre=debug -e'/^\d\d\d\d$/'

Compiling REx "^\d\d\d\d$"
Final program:
   1: BOL (2)
   2: DIGIT (3)
   3: DIGIT (4)
   4: DIGIT (5)
   5: DIGIT (6)
   6: EOL (7)
   7: END (0)
anchored ""$ at 4 stclass DIGIT anchored(BOL) minlen 4 
Freeing REx: "^\d\d\d\d$"

Now I'm going to see how well each version performs.

#! /usr/bin/env perl
use Benchmark qw':all';

cmpthese( -10, {
  'loop' => sub{ 1234 =~ /^\d{4}$/ },
  'repeat' => sub{ 1234 =~ /^\d\d\d\d$/ }
});

           Rate   loop repeat
loop   890004/s     --   -10%
repeat 983825/s    11%     --

While the /^\d\d\d\d$/ does consistently run faster, it isn't significantly faster. Which really just leaves it down to readability.

Let's take this example to the extreme:

/^\d{32}$/;
/^\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d$/;

I don't think there are many people who would argue that the second example is easier to read.

If we take it to the other extreme, the first style seems downright redundant.

/^\d{1}$/;
/^\d$/;

So what it really comes down to, is how many repetitions of \d, before your preference switches from just repeating the \d, to using a quantifier.

Brad Gilbert 2010-03-31 00:52:10

Sorry, It doesn't seems that `/^\d{4}$/` is faster from your benchmark result. `'repeat' => sub{ 1234 =~ /^\d\d\d\d$/ }` does more iterations per second so I think it is faster.

Hynek -Pichi- Vychodil 2010-03-31 13:06:46

@Hynek fixed. ` `

Brad Gilbert 2010-03-31 14:14:00

Answer 8

+1 A:

Like many things, it is a matter of how far you want to take it.

A real example.

Compare:

my @lines = $header =~ m/([^\n\r]{13}|[^\n\r]+)/g; #split header into groups of up to 13 characters

to

my @lines = $header =~ m/([^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r]|[^\n\r]+)/g; #split into groups of up to 13 characters

Can you still find the pipe '|'?

Matthew S 2010-03-31 02:31:36

Answer 9

A:

I would be likely to use either form, depending on the circumstances.

Let's ignore the strawman complexity of custom character-classes repeated 96 times all on one line, and instead focus on nicely written code.

Consider:

$foo =~ m{
        (\d\d\d\d)
    [ ] (\d\d\d?)
    [ ] (\w\w)
}x;

I've used code like this to parse data from weather sensors. I use this format because it closely matches the manufacturer's documentation. This works pretty well for "fixed width" data formats that don't quite live up to the promise of fixed width fields (this is distressingly common in practice).

You can argue that I should put the spaces on separate lines or on the same line as the preceding field, rather than on line with the subsequent field. But that is just formatting, and is truly a problem for perltidy.

In other cases, I have used code like this:

$foo =~ m{ 
        ( \d{4}   )
    [ ] ( \d{2,3} )
    [ ] ( \w{2}   )
}x;

To keep the above readable, you've got to add more whitespace, and play with formatting a bit more.

The second style scales with complexity better -- adding custom character classes and wide fields does not break readability.

The most important thing is to be consistent within a given regex. IOW, never do this:

$foo =~ m{ 
        ( \d\d\d\d )
    [ ] ( \d{2,3}  )
    [ ] ( \w\w     )
}x;

Ultimately, code performs two functions. The most well known function is that it tells the computer what to do. But the most important, yet largely overlooked function of code is to tell the maintenance programmer what the computer is doing.

daotoad 2010-03-31 08:38:39

ansaurus

tags:

views:

answers:

Is there a compelling reason to use quantifiers in Perl regular expressions instead of just repeating the character?

related questions