ansaurus

Question

How can I match end-of-line multiple times in a regex without interpolation?

Answer 1

+3 A:

The question is academic--there's no need for the $ anchors in your regex anyway. You should be using \n to match the newlines, because the $ only matches the gap between the linefeed and the character before it.

EDIT: What I'm trying to say is that you will never need to use $ that way. Any match that spans from one line to the next will have to consume the line separator somehow. Consider your example:

/^\[INFO\]$(.*?)$\[INFO\]/ms

If this did compile, the (.*?) would start out by consuming the first linefeed and keep going until it had matched \nxyz, where the second $ would succeed. But the next character is a linefeed, and the regex is looking for [, so that doesn't work. After backtracking, the (.*?) would reluctantly consume one more character--the second linefeed--but then the $ would fail.

Any time you try to match an EOL with $ and then some more stuff, the first "stuff" you'll have to match will be the linefeed, so why not match that instead? That's why the Perl regex compiler tries to interpret $\ as a variable name in your regex: it makes no sense to have an end-of-line anchor followed by a character that's not a line separator.

Alan Moore 2010-05-20 18:55:33

Yes, the question is academic. I've edited the post to show that I'm only interested is figuring out how to get $ to function as end-of-line in multiple places within the regex.

harschware 2010-05-20 22:09:42

@harschware: see my expanded answer.

Alan Moore 2010-05-21 06:15:11

I see what you mean, but again: what about the crux of the problem? How about if the regex were /^\[INFO\]$\nxyz/ms, then $\ is interpolated to undef and the regex fails to match. The problem is not how do I get my pattern to match... but how do you use $ as EOL in cases where it is getting interpolated?

harschware 2010-05-21 17:20:28

You *can* disable interpolation by using single-quotes as the regex delimiters, i.e., `m'^\[INFO\]$\nxyz'm`. But my point is that you don't have to. Perl is very clever about determining whether a sequence should be interpolated or not. Notice that it doesn't try to interpolate `$(`, which is also a built-in variable.

Alan Moore 2010-05-22 02:55:11

Actually, your comment about using m'' to specify the regex is the answer I was looking for. I did not see this comment before starting the bounty - I'll accept this answer when the waiting period is up.

harschware 2010-05-31 23:55:43

Answer 2

+3 A:

Based on the answer in perlfaq6 - How can I pull out lines between two patterns that are themselves on different lines? , here's what a one-liner would look like:

perl -0777 -ne 'print $1,"\n" while /\[INFO\]\s*(.*?)\s*\[INFO\]/sg' file.txt

The -0777 switch slurps in the whole file at once.

However, if you're after a subroutine that gives you the flexibility to choose what tag you want to extract, the File::Slurp module makes things a little easier:

use strict;
use warnings;
use File::Slurp qw/slurp/;

sub extract {

    my ( $tag, $fileName ) = @_;
    my $text = slurp $fileName;

    my ($info) = $text =~ /$tag\s*(.*?)\s*$tag/sg;
    return $info;
}

# Usage:
extract ( qr/\[INFO\]/, 'file.txt' );

Zaid 2010-05-20 21:04:35

Answer 3

+3 A:

When regexes get too tricky, they probably are the wrong tool. I might consider using the flip flop operator here. It's false until its lefthand side is true, then stays true until its righthand side is true. That way, you can choose where to start and end the extraction just by looking at individual lines:

my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE

open my $string_fh, '<', \$string;

while( <$string_fh> )
    {
    next if /\[INFO]/ .. /\[INFO]/;
    chomp;

    print "Extracted <$_>\n";
    }

If you are using Perl 5.10, you can use the generalized line ending \R in a regex:

use 5.010;

my $string = <<'HERE';
[INFO]
xyz
[INFO]
HERE

my( $extracted ) = $string =~ /(?:\A|\R)\[INFO]\R(.*?)\R\[INFO]\R/;

print "Extracted <$extracted>\n";

Don't get hung up on the end-of-line anchor.

brian d foy 2010-05-21 00:48:22

Great explanation, couldn't understand flip-flop's behaviour with regexes from perlfaq6 or perlop

Zaid 2010-05-21 02:42:28

(+1) Very cool! But, two problems: it is Perl6 specific. :-( And, it is yet another work around (I already devised a work around of my own in the post). At this point I'm really just asking how to get the end-of-line anchors to work in mid regex. Thanks for the education though.

harschware 2010-05-21 04:16:58

@harschware: Eh? Perl6-specific? This feature's present in Perl...

Zaid 2010-05-21 05:59:20

It's not only Perl 6, it's in the Perl 5 FAQ. It's not a workaround either. It's a straightforward way to extract text between two lines.

brian d foy 2010-05-21 06:12:22

Closing square brackets in the regexp need to be backslashed

Zaid 2010-05-21 07:34:37

You don't need to escape the closing square brace. It's not special when there isn't a special opening square brace to start a character class.

brian d foy 2010-05-21 08:13:04

The things you learn everyday... Thanks!

Zaid 2010-05-21 10:40:05

I guess was mistaken about it being perl6 only, sorry. When I said workaround I meant that it isn't addressing the dual role of $ as EOL and causing interpolation in a pattern, which is really what the question is about. But it is a quite clever bit of code.

harschware 2010-05-21 17:26:45

Well, don't force things to do what they have a tough time doing. Use tools that do the job naturally. If regexes are giving you pain, that's a sign they might not be the right tool.

brian d foy 2010-05-22 02:45:40

Once again, you can see I already have an alternate solution, and you've provided two others. This code bit is welcomed to be written in any number of ways, thats not the point. Identifying there is a dual nature issue with $ here and wondering what Perl provides to solve it, is the point. I think what I'm finding is that there isn't anything provided. I kind of thought qr// was supposed to do the trick and I don't know why it doesn't.

harschware 2010-05-22 23:11:58

Answer 4

+1 A:

Maybe the /x modifier can help:

m/ ^\[INFO\] $ # Match INFO line
   \n
   ^ (.*?) $ # Collect desired line
   \n 
   ^ \[INFO\] # Match another INFO line
/xms

I haven't tested that, so you'd probably have to debug it. But I think this will prevent the $ symbols from interpolating as variables.

Ryan Thompson 2010-05-31 23:58:34

Answer 5

A:

Although I've accepted Alan Moore's answer (Ryan Thompson's answer would also have done the trick too bad I could only accept one) I wanted to make perfectly clear the solution, as it was kind of buried in the comments and discussion. The following Perl script demonstrates that Perl is using the $ to interpolate variables if any character proceeds the dollar sign, and that turning off interpolation will allow the $ to be treated as EOL.

use strict;
use warnings;

my $x = "[INFO]\nxyz\n[INFO]";
if( $x =~ /^\[INFO\]$\n(.*?)$\n\[INFO\]/m ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

if( $x =~ m'^\[INFO\]$\n(.*?)$\n\[INFO\]'m ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

if( $x =~ m/ ^\[INFO\] $ # Match INFO line
\n
^ (.*?) $ # Collect desired line
\n 
^ \[INFO\] # Match another INFO line
/xms ) {
    print "'$1' FOUND\n";
} else {
    print "NO MATCH FOUND\n";
}

The script produces the following output:

Use of uninitialized value $\ in regexp compilation at t.pl line 5.
Use of uninitialized value $\ in regexp compilation at t.pl line 5.
NO MATCH FOUND
'xyz' FOUND
'xyz' FOUND

harschware 2010-06-02 17:44:34

Those warnings would seem to indicate that my answer is wrong.

Ryan Thompson 2010-06-16 22:02:31

No, they come from the first regex. the one I posted in the question (which produces warnings and no match). The second and third regex match without warnings.

harschware 2010-06-17 01:10:11

ansaurus

tags:

views:

answers:

How can I match end-of-line multiple times in a regex without interpolation?

related questions