views:

219

answers:

5

Hi, if I have a input with new lines in it like:

[INFO]
xyz
[INFO]

How can I pull out the xyz part using $ anchors? I tried a pattern like /^\[INFO\]$(.*?)$\[INFO\]/ms, but perl gives me:

Use of uninitialized value $\ in regexp compilation at scripts\t.pl line 6.

Is there a way to shut off interpolation so the anchors work as expected?

EDIT: The key is that the end-of-line anchor is a dollar sign but at times it may be necessary to intersperse the end-of-line anchor through the pattern. If the pattern is interpolating then you might get problems such as uninitialized $\. For instance an acceptable solution here is /^\[INFO\]\s*^(.*?)\s*^\[INFO\]/ms but that does not solve the crux of the first problem. I've changed the anchors to be ^ so there is no interpolation going on, and with this input I'm free to do that. But what about when I really do want to reference EOL with $ in my pattern? How do I get the regex to compile?

+3  A: 

The question is academic--there's no need for the $ anchors in your regex anyway. You should be using \n to match the newlines, because the $ only matches the gap between the linefeed and the character before it.

EDIT: What I'm trying to say is that you will never need to use $ that way. Any match that spans from one line to the next will have to consume the line separator somehow. Consider your example:

/^\[INFO\]$(.*?)$\[INFO\]/ms

If this did compile, the (.*?) would start out by consuming the first linefeed and keep going until it had matched \nxyz, where the second $ would succeed. But the next character is a linefeed, and the regex is looking for [, so that doesn't work. After backtracking, the (.*?) would reluctantly consume one more character--the second linefeed--but then the $ would fail.

Any time you try to match an EOL with $ and then some more stuff, the first "stuff" you'll have to match will be the linefeed, so why not match that instead? That's why the Perl regex compiler tries to interpret $\ as a variable name in your regex: it makes no sense to have an end-of-line anchor followed by a character that's not a line separator.

Alan Moore
Yes, the question is academic. I've edited the post to show that I'm only interested is figuring out how to get $ to function as end-of-line in multiple places within the regex.
harschware
@harschware: see my expanded answer.
Alan Moore
I see what you mean, but again: what about the crux of the problem? How about if the regex were /^\[INFO\]$\nxyz/ms, then $\ is interpolated to undef and the regex fails to match. The problem is not how do I get my pattern to match... but how do you use $ as EOL in cases where it is getting interpolated?
harschware
You *can* disable interpolation by using single-quotes as the regex delimiters, i.e., `m'^\[INFO\]$\nxyz'm`. But my point is that you don't have to. Perl is very clever about determining whether a sequence should be interpolated or not. Notice that it doesn't try to interpolate `$(`, which is also a built-in variable.
Alan Moore
Actually, your comment about using m'' to specify the regex is the answer I was looking for. I did not see this comment before starting the bounty - I'll accept this answer when the waiting period is up.
harschware
+3  A: 

Based on the answer in perlfaq6 - How can I pull out lines between two patterns that are themselves on different lines? , here's what a one-liner would look like:

perl -0777 -ne 'print $1,"\n" while /\[INFO\]\s*(.*?)\s*\[INFO\]/sg' file.txt

The -0777 switch slurps in the whole file at once.

However, if you're after a subroutine that gives you the flexibility to choose what tag you want to extract, the File::Slurp module makes things a little easier:

use strict;
use warnings;
use File::Slurp qw/slurp/;

sub extract {

    my ( $tag, $fileName ) = @_;
    my $text = slurp $fileName;

    my ($info) = $text =~ /$tag\s*(.*?)\s*$tag/sg;
    return $info;
}

# Usage:
extract ( qr/\[INFO\]/, 'file.txt' );
Zaid
+3  A: 

When regexes get too tricky, they probably are the wrong tool. I might consider using the flip flop operator here. It's false until its lefthand side is true, then stays true until its righthand side is true. That way, you can choose where to start and end the extraction just by looking at individual lines:

my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE

open my $string_fh, '<', \$string;

while( <$string_fh> )
    {
    next if /\[INFO]/ .. /\[INFO]/;
    chomp;

    print "Extracted <$_>\n";
    }

If you are using Perl 5.10, you can use the generalized line ending \R in a regex:

use 5.010;

my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE

my( $extracted ) = $string =~ /(?:\A|\R)\[INFO]\R(.*?)\R\[INFO]\R/;

print "Extracted <$extracted>\n";

Don't get hung up on the end-of-line anchor.

brian d foy
Great explanation, couldn't understand flip-flop's behaviour with regexes from perlfaq6 or perlop
Zaid
(+1) Very cool! But, two problems: it is Perl6 specific. :-( And, it is yet another work around (I already devised a work around of my own in the post). At this point I'm really just asking how to get the end-of-line anchors to work in mid regex. Thanks for the education though.
harschware
@harschware: Eh? Perl6-specific? This feature's present in Perl...
Zaid
It's not only Perl 6, it's in the Perl 5 FAQ. It's not a workaround either. It's a straightforward way to extract text between two lines.
brian d foy
Closing square brackets in the regexp need to be backslashed
Zaid
You don't need to escape the closing square brace. It's not special when there isn't a special opening square brace to start a character class.
brian d foy
The things you learn everyday... Thanks!
Zaid
I guess was mistaken about it being perl6 only, sorry. When I said workaround I meant that it isn't addressing the dual role of $ as EOL and causing interpolation in a pattern, which is really what the question is about. But it is a quite clever bit of code.
harschware
Well, don't force things to do what they have a tough time doing. Use tools that do the job naturally. If regexes are giving you pain, that's a sign they might not be the right tool.
brian d foy
Once again, you can see I already have an alternate solution, and you've provided two others. This code bit is welcomed to be written in any number of ways, thats not the point. Identifying there is a dual nature issue with $ here and wondering what Perl provides to solve it, is the point. I think what I'm finding is that there isn't anything provided. I kind of thought qr// was supposed to do the trick and I don't know why it doesn't.
harschware
+1  A: 

Maybe the /x modifier can help:

m/ ^\[INFO\] $ # Match INFO line
   \n
   ^ (.*?) $ # Collect desired line
   \n 
   ^ \[INFO\] # Match another INFO line
/xms

I haven't tested that, so you'd probably have to debug it. But I think this will prevent the $ symbols from interpolating as variables.

Ryan Thompson
A: 

Although I've accepted Alan Moore's answer (Ryan Thompson's answer would also have done the trick too bad I could only accept one) I wanted to make perfectly clear the solution, as it was kind of buried in the comments and discussion. The following Perl script demonstrates that Perl is using the $ to interpolate variables if any character proceeds the dollar sign, and that turning off interpolation will allow the $ to be treated as EOL.

use strict;
use warnings;

my $x = "[INFO]\nxyz\n[INFO]";
if( $x =~ /^\[INFO\]$\n(.*?)$\n\[INFO\]/m ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

if( $x =~ m'^\[INFO\]$\n(.*?)$\n\[INFO\]'m ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

if( $x =~ m/ ^\[INFO\] $ # Match INFO line
\n
^ (.*?) $ # Collect desired line
\n 
^ \[INFO\] # Match another INFO line
/xms ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

The script produces the following output:

Use of uninitialized value $\ in regexp compilation at t.pl line 5.
Use of uninitialized value $\ in regexp compilation at t.pl line 5.
NO MATCH FOUND
'xyz' FOUND
'xyz' FOUND
harschware
Those warnings would seem to indicate that my answer is wrong.
Ryan Thompson
No, they come from the first regex. the one I posted in the question (which produces warnings and no match). The second and third regex match without warnings.
harschware