ansaurus

Question

How can I retrieve the N-th line from a text file using Perl?

Answer 1

+9 A:

If you are reading from a filehandle:

while (my $line = <$fh>) {
    if ($. == 1 or not $. % 10) {
        print $line;
    }
}

If you have a scalar that holds a bunch of lines like:

my $s = join "", map { "$_\n" } "a" .. "z";

Then you can treat the scalar like a file by passing a reference to it during an open:

open my $fh, "<", \$s
    or die "could not open in-memory file: $!";

and then use the solution above.

Putting it all together, you get

#!/usr/bin/perl

use strict;
use warnings;

my $s = join "", map { "$_\n" } "a" .. "z";

open my $fh, "<", \$s
    or die "could not open in-memory file: $!";

while (my $line = <$fh>) {
    if ($. == 1 or not $. % 10) {
        print "$. $line";
    }
}

Note, this trick only works if you have built perl with PerlIO on, but that has been the default since Perl 5.8. You will need to grab IO::Scalar from CPAN if your version of perl wasn't compiled with Perl IO.

For truly insane levels of weirdness, you could use Tie::File on the in-memory file:

#!/usr/bin/perl

use strict;
use warnings;

use Tie::File;

my $s = join "", map { "$_\n" } "a" .. "z";

open my $fh, "<", \$s
    or die "could not open in-memory file: $!";

tie my @lines, "Tie::File", $fh
    or die "could not tie in-memory file: $!";

my $i = 0;
while (defined $lines[$i]) {
    print "$lines[$i]\n";
} continue {
    $i += 10;
}

Chas. Owens 2010-09-10 22:36:42

Clever trick with the scalar ref filehandle.

Schwern 2010-09-10 22:42:34

@Schwern What I like most about it is that if `$s` were huge then splitting it into an array would take forever and even more memory, but by using it as an in-memory file we only have to iterate over it once and only one line's worth of memory is used at a time.

Chas. Owens 2010-09-10 22:45:52

We have some more scalar filehandle tricks like this in _Effective Perl Programming_. There's no more splitting into arrays, etc. Just read a scalar line by line. :)

brian d foy 2010-09-11 01:24:10

Checking again, I see that you're first solution is what I think is the right one (lines 1, 10, 20), but the Tie::File example is different, giving (lines 1, 11, 21).

brian d foy 2010-09-12 00:24:00

@brian d foy Yeah, I didn't bother making the insane one correct.

Chas. Owens 2010-09-12 01:08:24

Answer 2

+5 A:

Here's how you'd do it with a regex taking advantage of the /g modifier.

my $count = 0;
my @found;
while($text =~ /\G(.*)\n/g) {
    next if $count++ % 10 != 0;

    push @found, $1;
}

I bench it at about about 50% faster than Chas' scalar ref filehandle solution for small strings of less than 100 lines, but at 1000 lines and up it levels off to just 20% faster.

Chas' filehandle solution is safer (if you write the regex wrong you can have yourself an infinite loop), simpler, and not significantly slower nor use more memory. Use that.

Schwern 2010-09-10 22:56:19

Can you post your benchmark script? I wonder why my results are so different from yours.

brian d foy 2010-09-11 02:19:40

@brian d foy I think the discrepancy is related to your benchmarking my insane code at the end vs the solution I proposed (the penultimate code block).

Chas. Owens 2010-09-11 12:17:57

In my revised tests, your sane code comes out faster than Schwern's.

brian d foy 2010-09-11 23:57:50

Answer 3

+3 A:

Here's a benchmark using my solution of a simple filehandle read versus Schwern's regex and Chas.'s tie-ing.

This is Perl 5.12.2 running on my Mac Pro:

                 Rate Chas. Chas. modified drewk Schwern Chas. sane drewk2 brian
Chas.          70.0/s    --           -33%  -94%    -94%       -95%   -95%  -96%
Chas. modified  104/s   48%             --  -91%    -91%       -92%   -93%  -94%
drewk          1163/s 1560%          1019%    --     -5%       -15%   -23%  -35%
Schwern        1220/s 1641%          1073%    5%      --       -11%   -20%  -32%
Chas. sane     1370/s 1856%          1218%   18%     12%         --   -10%  -23%
drewk2         1515/s 2064%          1358%   30%     24%        11%     --  -15%
brian          1786/s 2450%          1618%   54%     46%        30%    18%    --

This is Perl 5.10.1 on the same machine:

                 Rate Chas. Chas. modified drewk Schwern Chas. sane drewk2 brian
Chas.          66.9/s    --           -35%  -94%    -95%       -95%   -96%  -96%
Chas. modified  103/s   54%             --  -91%    -92%       -93%   -93%  -94%
drewk          1111/s 1560%           981%    --    -17%       -22%   -27%  -40%
Schwern        1333/s 1892%          1197%   20%      --        -7%   -12%  -28%
Chas. sane     1429/s 2034%          1290%   29%      7%         --    -6%  -23%
drewk2         1515/s 2164%          1374%   36%     14%         6%     --  -18%
brian          1852/s 2667%          1702%   67%     39%        30%    22%    --

These results don't surprise me that much. Tie::File seems slower than it should be, but I expected it to be slow. It's nifty, but I find Tie::File is often poor trade-off in performance for a nice interface to something that wasn't that hard to start with. It's nice if you need random and repeated access, but for a single pass sequential access it's the wrong tool. Chas. does a bit more work than I think he really needs in that example. We know the indices of the lines that we want, so we can just take a slice of the tied array. The slice is about 150% faster than the while loop looking at every line.

To see an extreme result, I replicated the lines by 1,000 times (so, about 1,300,000 lines in the file):

 $scalar = slurp( $file ) x 1000;

These are the results for the big file on Perl 5.12.2:

                  Rate Chas. Chas. modified drewk drewk2 Schwern Chas. sane brian
Chas.          0.695/s    --           -32%  -91%   -94%    -94%       -95%  -96%
Chas. modified  1.02/s   46%             --  -86%   -91%    -92%       -93%  -94%
drewk           7.38/s  962%           626%    --   -34%    -39%       -47%  -59%
drewk2          11.2/s 1512%          1002%   52%     --     -7%       -19%  -38%
Schwern         12.1/s 1635%          1086%   63%     8%      --       -13%  -33%
Chas. sane      13.9/s 1896%          1264%   88%    24%     15%         --  -23%
brian           18.0/s 2495%          1674%  144%    61%     50%        30%    --

drewk's solutions creating new arrays show their scaling problem now. Since they aren't any simpler than the other solutions and they have this big drawback, there's no reason to do it that way.

Here's my benchmark program. There's a very slight difference in the programs. My solution (and Chas.'s first solution) gets the 1st, 10th, 20th, and so on lines as noted in the question text. The other solutions get the 1st, 11th, 21st and so on lines as noted in the broken code. That doesn't really matter for the benchmark though.

#!perl
use strict;
use warnings;

use File::Slurp qw(slurp);
use Tie::File;
use Benchmark qw(cmpthese);
use vars qw($scalar);

chomp( my $file = `perldoc -l perlfaq5` );
#$file = '/Users/brian/Desktop/lines';
print "file is $file\n";
$scalar = slurp( $file );

cmpthese( 1000, {
    'Chas.'          => \&chas,
    'Schwern'        => \&schwern,
    'brian'          => \&brian,
    'Chas. modified' => \&chas_modified,
    'Chas. sane'     => \&chas_sane,
    'drewk'          => \&drewk,
    'drewk2'         => \&drewk2,
    });

sub drewk {
   my @arr = split(/\n/, $scalar);
   my @found;
   for(my $i=0; $i<=$#arr; $i+=10){
    #  print "drewk[$i] $arr[$i]\n";
      push @found, $arr[$i];
    }
}
sub drewk2 {
   my $i=0;
   my @found;
   foreach(split(/\n/, $scalar)) {
      next if $i++ % 10;
#      print "drewk2[$i] $_\n";
      push @found, $_;
   }
}
sub schwern {
    my $count = 0;
    my @found;
    while($scalar =~ /\G(.*)\n/g) {
        next if $count++ % 10 != 0;
#       print "schwern[$count] $1\n";
        push @found, $1;
        }
    }

sub chas {
    open my $fh, "<", \$scalar;

    tie my @lines, "Tie::File", $fh
        or die "could not tie in-memory file: $!";

    my $i = 0;
    my @found = ();
    while (defined $lines[$i]) {
        # print "chas[$i]: $lines[$i]\n";
        push @found, $lines[$i];
        } continue {
            $i += 10;
        }   
    }

sub chas_modified {
    open my $fh, "<", \$scalar;

    tie my @lines, "Tie::File", $fh
        or die "could not tie in-memory file: $!";

    my $highest_multiple = int( $#lines / 10 ) ;
    my @found = @lines[ map { $_ * 10  - ($_?1:0) } 0 .. $highest_multiple ]; 
    #print join "\n", @found;
    }

sub chas_sane {
    open my $fh, "<", \$scalar;

    my @found;
    while (my $line = <$fh>) {
        if ($. == 1 or not $. % 10) {
            #print "chas_sane[$.] $line";
            push @found, $_;
            }
        }
    }

sub brian {
    open my $fh, '<', \$scalar;
    my @found = scalar <$fh>;
    while( <$fh> ) {
        next if $. % 10;
        #print "brian[$.] $_";
        push @found, $_;
        }
    }

brian d foy 2010-09-11 02:00:51

I would not suggest using `Tie::File`, note that I said it was "truly insane levels of weirdness".

Chas. Owens 2010-09-11 12:09:39

@Chas: `Tie::File` is great for certain modifications of large files safely in situ. Otherwise, there are better tools as you say.

drewk 2010-09-11 16:20:35

Don't forget that you can WRITE in in memory files too!

drewk 2010-09-11 16:42:26

+1: I truly learned something new from your post. Just a small niggle: You have an off by 1 error in your logic since $. starts at 1 not 0. Easily seen if you have `$scalar.="Line $_\n" for (0..100);` Delete the initial read from <$fh> and use next if ($.-1) % 10; in your loop logic. Thanks! You have taught me so much Perl over the years...

drewk 2010-09-11 20:39:14

Yes, $. starts at 1. To get the 1st line as noted in the problem, I have the initial read. To get the ones that are multiples of 10, I have the `$. % 10`. Uncomment the the print line to see which lines it picks up and you'll see that it's getting the multiples of 10.

brian d foy 2010-09-11 23:31:25

@Brian: As stated in my post, your file handle solution is *far* better on larger data. I was, however, just trying to get the poster going with the code that he posted. I used his loop! It is completely unclear to me if he wanted the first line then every tenth there after (my interpretation and his loop) or line "0" then the line 9 lines later, then every 10 after that. He did not post enough information to tell. The code I did does reproduce *exactly* the output of what it *seems* he wanted in his post based on his code.

drewk 2010-09-12 05:46:10

The filehandle solution is better for smaller data too. A solution that doesn't scale and performs similarly at small sizes just isn't any good. Getting someone started with code that's bad isn't good either. Show them a better way than what they have so they learn how to do thing right. :)

brian d foy 2010-09-12 07:45:50

@brian d foy: Point taken. I think the filehandle solution is slick and it is relatively new to ME. The only potential drawback is the (very small) chance that the local Perl was not compiled with PerlIO or earlier Perl. How do you check if you have PerlIO at runtime? (other than perl -V) Thanks for the very educational discussion. I learned something...

drewk 2010-09-12 19:47:11

You can get many of the compilation settings through the Config module, which exports a hash called %Config. Its keys are the things you see in `perl -V`, so you can check for `$Config{useperlio} == 'define'`. Also, anything older than 5.8 doesn't exist in my universe, and even that is fading now that 5.14 is around the corner. :)

brian d foy 2010-09-13 02:16:24

Answer 4

A:

If Schern's comment is correct that your "list of text" means its in a $scalar one way to fix that is with Perl's split You can then use the code that you have written thus:

sub drewk {
   my @arr = split(/\n/, $scalar);
   for(my $i=0; $i<=$#arr; $i+=10){
       #print $arr[$i],"\n";
    }
}

Rather than use a C style loop, you can write very readable Perl idiom to do the same thing that is also faster:

sub drewk2 {
   my $i=0;
   my @found;
   foreach(split(/\n/, $scalar)) {
      next if $i++ % 10;
      #print "$_\n";
      push @found, $_;
   }
}

Plugging those into brian's benchmark, you get very competitive result:

                 Rate    Chas. Chas. modified Schwern    drewk    brian   drewk2
Chas.          86.1/s       --           -37%    -95%     -95%     -96%     -96%
Chas. modified  136/s      59%             --    -92%     -92%     -93%     -94%
Schwern        1695/s    1869%          1142%      --      -3%     -14%     -22%
drewk          1754/s    1939%          1186%      4%       --     -11%     -19%
brian          1961/s    2178%          1337%     16%      12%       --     -10%
drewk2         2174/s    2426%          1493%     28%      24%      11%       --

(this on a iMac 2.93 GHz Intel COre i7 with Perl 5.10)

You didn't post the context of code leading up to your posted loop. Perhaps you did something like this:

   $scalar="line 1\nline 2\n ... line n";

   push @arr, $scalar;
   #or
   $arr[0]=$scalar;

thinking that the \n would cause the lines to end up in different array elements? Post context next time...

----Edit:

The original post states How can I print 1st, 10th, 20th... lines (not array index) number in a long list of text. If by "long list of text" you mean megbytes and gigabytes, use Brian's or Chas' file handle approach. It is slick, fast, and the data will not be duplicated in memory. If "long list of text" is something of a size where RAM is plentiful, you can use split, /\n/g, etc or whatever seems to make sense to you and the data.

drewk 2010-09-11 06:55:34

The downside to split is that if the scalar was eating a large amount of memory, then the array will eat at least as much. In this case, the scalar eats 51,760 bytes and the array eats 152,824 bytes (sizes found with [`Devel::Size`](http://search.cpan.org/dist/Devel-Size-0.71/lib/Devel/Size.pm)). Now, 200k isn't that much, but if the scalar had held a gigabyte...

Chas. Owens 2010-09-11 12:41:41

@Chas: Yes, fair point. The algorithm needs to make sense from the original data to the end of the processing of it, but we don't know what that is. I think the OPer is just figuring out the difference between the two at this point. I could be wrong...

drewk 2010-09-11 16:18:24

In your first example, you pick up index 0, 10, 20, and so on, which are lines 1, 11, 21, and so on. Since you don't push the results onto an array, you can't compare that to my examples that all do the same thing.

brian d foy 2010-09-11 23:42:34

In your second example, you pick up the first element for $i = 0, and then don't pick up another line until $i = 10, but since you are counting from 0, that's line 11. It's ten elements away from the 1st line, but that's not how I read the problem.

brian d foy 2010-09-11 23:50:32

@brian: based on the limited info posted in the OP your interpretation is fair. I used *his* posted loop and reproduced the output of that; you seemed to focus on what he said in his descriptive words and focused on that. Who knows? The poster has not been back to clarify.

drewk 2010-09-12 06:32:01

ansaurus

tags:

views:

answers:

How can I retrieve the N-th line from a text file using Perl?

related questions