tags:

views:

122

answers:

7

I have to file that has different types of lines. I want to select only those lines that have an user-agent. I know that the line that has this is something like this.

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de-DE; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16

So, I want to identify the line that starts with the string "User-Agent", but after that I want to process the rest of the line excluding this string. My question is does Perl store the remaining string in any special variable that I can use to process further? So, basically I want to match the line that starts with that string but after that work on the rest of it excluding that string.

I search for that line with a simple regexp

/^User-Agent:/
+3  A: 
if ($line =~ /^User\-Agent\: (.*?)$/) {
    &process_string($1)
}
eumiro
A: 

You can use $' to capture the post-match part of the string:

if ( $line =~ m/^User-Agent: / ) {
    warn $';
}

(Note that there's a trailing space after the colon there.)

But note, from perlre:

WARNING: Once Perl sees that you need one of $& , $`, or $' anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression (?: ... ) instead.) But if you never use $& , $` or $' , then patterns without capturing parentheses will not be penalized. So avoid $& , $' , and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.

martin clayton
[`perlvar`](http://p3rl.org/perlvar) about `$'`: "The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches." Capturing the parts you need seems like a much better idea for anything but oneliners.
rafl
+4  A: 

(my $remainder = $str) =~ s/^User-Agent: //;

M42
+3  A: 

The substr solution:

my $start = "User-Agent: ";

if ($start eq substr $line, 0, length($start)) {
    my $remainder = substr $line, length($start);
}
eugene y
I tend not to like this one because it's case sensitive and matches only one space. It's probably not that big of a deal, but HTTP doesn't constrain those things. Also, I tend to use index() to check if a substring is there since I don't have to care about the length.
brian d foy
+2  A: 

You could use the $' variable, but don't--that adds a lot of overhead. Probably just about as good--for the same purposes--is @+ variable or, in English, @LAST_MATCH_END.

So this will get you there:

use English qw<@LAST_MATCH_END>;

my $value = substr( $line, $LAST_MATCH_END[0] );
Axeman
Why the magic variable at all when you can do it so cleanly with ()-grouping or with copying and substitution, like M42 showed?
Thomas Kappler
@Thomas: M42's solution is *destructive*. Also, one part of the question was: "does Perl store the remaining string in any special variable that I can use to process further?" Well, yes it does, but it's highly deprecated, but there is an effective work around using `substr` that is not as high-cost and equally not destructive. TIMTOWTDI, but destructive alterations are not as recommended for generic solutions.
Axeman
It's not destructive, because the substitution works on the new variable. Try it out: use 5.010;my $orig = 'User-Agent: Mozilla/5.0';(my $agent = $orig) =~ s/^User-Agent: //;say $orig;say $agent;
Thomas Kappler
@Thomas: Yeah, you're right. I had glanced at it. It's a quibble, but I've always disliked the way that construction looks. The `substr` expression is cleaner. And the stuff about the "special variable" is still on point. eumiro's capture works about as well, doesn't copy a complete string in order just to chop the front part off and looks cleaner. But it's about the same thing as mine--and I thing that `@LAST_MATCH_START` and `@LAST_MATCH_END` deserve to be known.
Axeman
+2  A: 

Perl 5.10 has a nice feature that allows you to get the simplicity of the $' solutions without the performance problems. You use the /p flag and the ${^POSTMATCH} variable:

 use 5.010;
 if( $string =~ m/^User-Agent:\s+/ip ) {
      my $agent = ${^POSTMATCH};
      say $agent;
      }

There are some other tricks though. If you can't use Perl 5.010 or later, you use a global match in scalar context, the value of pos is where you left off in the string. You can use that position in substr:

 if( $string =~ m/^User-Agent:\s+/ig ) {
      my $agent = substr $string, pos( $string );
      print $agent, "\n";
      }

The pos is similar to the @+ trick that Axeman shows. I think I have some examples with @+ and @- in Mastering Perl in the first chapter.

With Perl 5.14, which is coming soon, there's another interesting way to do this. The /r flag on the s/// does a non-destructive substitution. That is, it matches the bound string but performs the substitution on a copy and returns the copy:

use 5.013;  # for now, but 5.014 when it's released
my $string = 'User-Agent: Firefox';
my $agent = $string =~ s/^User-Agent:\s+//r;
say $agent;

I thought that /r was silly at first, but I'm really starting to love it. So many things turn out to be really easy with it. This is similar to the idiom that M42 shows, but it's a bit tricky because the old idiom does an assignment then a substitution, where the /r feature does a substitution then an assignment. You have to be careful with your parentheses there to ensure the right order happens.

Note in this case that since the version is Perl 5.12 or later, you automatically get strictures.

brian d foy
A: 

Use $' to get the part of the string to the right of the match.

There is much wailing and gnashing of teeth in the other answers about the "considerable performance penalty" but unless you actually know that your program is rich in use of regular expressions, and that you have a performance problem, I wouldn't worry about it.

We worry too often about optimizations that have little-to-no impact on the actual code. Chances are, this is one of them, too.

Andy Lester