How can I print 1st, 10th, 20th... lines (not array index) number in a long list of text. Of course, the following doesn't work:
for(my $i=0; $i<=$arr_size; $i+=10){
print $arr[$i],"\n";
}
Many thanks in advance.
How can I print 1st, 10th, 20th... lines (not array index) number in a long list of text. Of course, the following doesn't work:
for(my $i=0; $i<=$arr_size; $i+=10){
print $arr[$i],"\n";
}
Many thanks in advance.
If you are reading from a filehandle:
while (my $line = <$fh>) {
if ($. == 1 or not $. % 10) {
print $line;
}
}
If you have a scalar that holds a bunch of lines like:
my $s = join "", map { "$_\n" } "a" .. "z";
Then you can treat the scalar like a file by passing a reference to it during an open
:
open my $fh, "<", \$s
or die "could not open in-memory file: $!";
and then use the solution above.
Putting it all together, you get
#!/usr/bin/perl
use strict;
use warnings;
my $s = join "", map { "$_\n" } "a" .. "z";
open my $fh, "<", \$s
or die "could not open in-memory file: $!";
while (my $line = <$fh>) {
if ($. == 1 or not $. % 10) {
print "$. $line";
}
}
Note, this trick only works if you have built perl
with PerlIO on, but that has been the default since Perl 5.8. You will need to grab IO::Scalar
from CPAN if your version of perl
wasn't compiled with Perl IO.
For truly insane levels of weirdness, you could use Tie::File
on the in-memory file:
#!/usr/bin/perl
use strict;
use warnings;
use Tie::File;
my $s = join "", map { "$_\n" } "a" .. "z";
open my $fh, "<", \$s
or die "could not open in-memory file: $!";
tie my @lines, "Tie::File", $fh
or die "could not tie in-memory file: $!";
my $i = 0;
while (defined $lines[$i]) {
print "$lines[$i]\n";
} continue {
$i += 10;
}
Here's how you'd do it with a regex taking advantage of the /g
modifier.
my $count = 0;
my @found;
while($text =~ /\G(.*)\n/g) {
next if $count++ % 10 != 0;
push @found, $1;
}
I bench it at about about 50% faster than Chas' scalar ref filehandle solution for small strings of less than 100 lines, but at 1000 lines and up it levels off to just 20% faster.
Chas' filehandle solution is safer (if you write the regex wrong you can have yourself an infinite loop), simpler, and not significantly slower nor use more memory. Use that.
Here's a benchmark using my solution of a simple filehandle read versus Schwern's regex and Chas.'s tie-ing.
This is Perl 5.12.2 running on my Mac Pro:
Rate Chas. Chas. modified drewk Schwern Chas. sane drewk2 brian
Chas. 70.0/s -- -33% -94% -94% -95% -95% -96%
Chas. modified 104/s 48% -- -91% -91% -92% -93% -94%
drewk 1163/s 1560% 1019% -- -5% -15% -23% -35%
Schwern 1220/s 1641% 1073% 5% -- -11% -20% -32%
Chas. sane 1370/s 1856% 1218% 18% 12% -- -10% -23%
drewk2 1515/s 2064% 1358% 30% 24% 11% -- -15%
brian 1786/s 2450% 1618% 54% 46% 30% 18% --
This is Perl 5.10.1 on the same machine:
Rate Chas. Chas. modified drewk Schwern Chas. sane drewk2 brian
Chas. 66.9/s -- -35% -94% -95% -95% -96% -96%
Chas. modified 103/s 54% -- -91% -92% -93% -93% -94%
drewk 1111/s 1560% 981% -- -17% -22% -27% -40%
Schwern 1333/s 1892% 1197% 20% -- -7% -12% -28%
Chas. sane 1429/s 2034% 1290% 29% 7% -- -6% -23%
drewk2 1515/s 2164% 1374% 36% 14% 6% -- -18%
brian 1852/s 2667% 1702% 67% 39% 30% 22% --
These results don't surprise me that much. Tie::File seems slower than it should be, but I expected it to be slow. It's nifty, but I find Tie::File is often poor trade-off in performance for a nice interface to something that wasn't that hard to start with. It's nice if you need random and repeated access, but for a single pass sequential access it's the wrong tool. Chas. does a bit more work than I think he really needs in that example. We know the indices of the lines that we want, so we can just take a slice of the tied array. The slice is about 150% faster than the while
loop looking at every line.
To see an extreme result, I replicated the lines by 1,000 times (so, about 1,300,000 lines in the file):
$scalar = slurp( $file ) x 1000;
These are the results for the big file on Perl 5.12.2:
Rate Chas. Chas. modified drewk drewk2 Schwern Chas. sane brian
Chas. 0.695/s -- -32% -91% -94% -94% -95% -96%
Chas. modified 1.02/s 46% -- -86% -91% -92% -93% -94%
drewk 7.38/s 962% 626% -- -34% -39% -47% -59%
drewk2 11.2/s 1512% 1002% 52% -- -7% -19% -38%
Schwern 12.1/s 1635% 1086% 63% 8% -- -13% -33%
Chas. sane 13.9/s 1896% 1264% 88% 24% 15% -- -23%
brian 18.0/s 2495% 1674% 144% 61% 50% 30% --
drewk's solutions creating new arrays show their scaling problem now. Since they aren't any simpler than the other solutions and they have this big drawback, there's no reason to do it that way.
Here's my benchmark program. There's a very slight difference in the programs. My solution (and Chas.'s first solution) gets the 1st, 10th, 20th, and so on lines as noted in the question text. The other solutions get the 1st, 11th, 21st and so on lines as noted in the broken code. That doesn't really matter for the benchmark though.
#!perl
use strict;
use warnings;
use File::Slurp qw(slurp);
use Tie::File;
use Benchmark qw(cmpthese);
use vars qw($scalar);
chomp( my $file = `perldoc -l perlfaq5` );
#$file = '/Users/brian/Desktop/lines';
print "file is $file\n";
$scalar = slurp( $file );
cmpthese( 1000, {
'Chas.' => \&chas,
'Schwern' => \&schwern,
'brian' => \&brian,
'Chas. modified' => \&chas_modified,
'Chas. sane' => \&chas_sane,
'drewk' => \&drewk,
'drewk2' => \&drewk2,
});
sub drewk {
my @arr = split(/\n/, $scalar);
my @found;
for(my $i=0; $i<=$#arr; $i+=10){
# print "drewk[$i] $arr[$i]\n";
push @found, $arr[$i];
}
}
sub drewk2 {
my $i=0;
my @found;
foreach(split(/\n/, $scalar)) {
next if $i++ % 10;
# print "drewk2[$i] $_\n";
push @found, $_;
}
}
sub schwern {
my $count = 0;
my @found;
while($scalar =~ /\G(.*)\n/g) {
next if $count++ % 10 != 0;
# print "schwern[$count] $1\n";
push @found, $1;
}
}
sub chas {
open my $fh, "<", \$scalar;
tie my @lines, "Tie::File", $fh
or die "could not tie in-memory file: $!";
my $i = 0;
my @found = ();
while (defined $lines[$i]) {
# print "chas[$i]: $lines[$i]\n";
push @found, $lines[$i];
} continue {
$i += 10;
}
}
sub chas_modified {
open my $fh, "<", \$scalar;
tie my @lines, "Tie::File", $fh
or die "could not tie in-memory file: $!";
my $highest_multiple = int( $#lines / 10 ) ;
my @found = @lines[ map { $_ * 10 - ($_?1:0) } 0 .. $highest_multiple ];
#print join "\n", @found;
}
sub chas_sane {
open my $fh, "<", \$scalar;
my @found;
while (my $line = <$fh>) {
if ($. == 1 or not $. % 10) {
#print "chas_sane[$.] $line";
push @found, $_;
}
}
}
sub brian {
open my $fh, '<', \$scalar;
my @found = scalar <$fh>;
while( <$fh> ) {
next if $. % 10;
#print "brian[$.] $_";
push @found, $_;
}
}
If Schern's comment is correct that your "list of text" means its in a $scalar
one way to fix that is with Perl's split You can then use the code that you have written thus:
sub drewk {
my @arr = split(/\n/, $scalar);
for(my $i=0; $i<=$#arr; $i+=10){
#print $arr[$i],"\n";
}
}
Rather than use a C style loop, you can write very readable Perl idiom to do the same thing that is also faster:
sub drewk2 {
my $i=0;
my @found;
foreach(split(/\n/, $scalar)) {
next if $i++ % 10;
#print "$_\n";
push @found, $_;
}
}
Plugging those into brian's benchmark, you get very competitive result:
Rate Chas. Chas. modified Schwern drewk brian drewk2
Chas. 86.1/s -- -37% -95% -95% -96% -96%
Chas. modified 136/s 59% -- -92% -92% -93% -94%
Schwern 1695/s 1869% 1142% -- -3% -14% -22%
drewk 1754/s 1939% 1186% 4% -- -11% -19%
brian 1961/s 2178% 1337% 16% 12% -- -10%
drewk2 2174/s 2426% 1493% 28% 24% 11% --
(this on a iMac 2.93 GHz Intel COre i7 with Perl 5.10)
You didn't post the context of code leading up to your posted loop. Perhaps you did something like this:
$scalar="line 1\nline 2\n ... line n";
push @arr, $scalar;
#or
$arr[0]=$scalar;
thinking that the \n
would cause the lines to end up in different array elements? Post context next time...
----Edit:
The original post states How can I print 1st, 10th, 20th... lines (not array index) number in a long list of text.
If by "long list of text" you mean megbytes and gigabytes, use Brian's or Chas' file handle approach. It is slick, fast, and the data will not be duplicated in memory. If "long list of text" is something of a size where RAM is plentiful, you can use split, /\n/g, etc or whatever seems to make sense to you and the data.