If the string for the searches is the same, let Perl do the processing by using the search phrase as input record separator:
open my $fh, '<', 'test.dat' or die "can't open $!"; # usual way of opening a file
my @list; # declare empty array 'list' (results)
$/= 'add this word to the list:'; # define custom input record seperator
while( <$fh> ) { # read records one by one
push @list, $1 if /(\S\S*)/
}
close $fh; # thats it, close file!
print join "\n", @list; # this will list the results
The above is "almost ok", it will save the first word of the file in $list[0] because
of the way of the processing. But this way makes it very easy to comprehend (imho)
blah <== first word of the file
1234.56
PINAPPLE
1!@#$%^&*()[]{};:'",<.>/?asdf
Q: why not simply look the strings up with one regex over the entire data (as has already been suggested here). Because in my experience, the record-wise procesing with per-record regular expression (probably very complicated regex in a real use case) will be faster - especially on very large files. Thats the reason.
Real world test
To back this claim up, I performed some tests with a 200MB data file containing 10,000 of
your markers. The test source follows:
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
use FILE::Slurp;
# 'data.dat', a 200MB data file, containing 10_000
# markers: 'add this word to the list:' and a
# one of different data items after each.
my $t = timethese(10,
{
'readline+regex' => sub { # trivial reading line-by-line
open my $fh, '<', 'data.dat' or die "can't open $!";
my @list;
while(<$fh>) {
push @list,$1 if /add this word to the list:\s*(\S+)/
}
close $fh;
return scalar @list;
},
'readIRS+regex' => sub { # treat each 'marker' as start of an input record
open my $fh, '<', 'data.dat' or die "can't open $!";
$/= 'add this word to the list:'; # new IRS
my @list;
while(<$fh>) { push @list, $1 if /(\S+)/ }
close $fh;
return scalar @list;
},
'slurp+regex' => sub { # read the whole file and apply regular expression
my $filecontents = File::Slurp::read_file('data.dat');
my @list = $filecontents =~ /add this word to the list:\s*(\S+)/g;
return scalar @list;
},
}
);
cmpthese( $t ) ;
which outputs the following timing results:
Benchmark: timing 10 iterations of readIRS+regex, readline+regex, slurp+regex...
readIRS+regex: 43 wallclock secs (37.11 usr + 5.48 sys = 42.59 CPU) @ 0.23/s (n=10)
readline+regex: 42 wallclock secs (36.47 usr + 5.49 sys = 41.96 CPU) @ 0.24/s (n=10)
slurp+regex: 142 wallclock secs (135.85 usr + 4.98 sys = 140.82 CPU) @ 0.07/s (n=10)
s/iter slurp+regex readIRS+regex readline+regex
slurp+regex 14.1 -- -70% -70%
readIRS+regex 4.26 231% -- -1%
readline+regex 4.20 236% 1% --
which basically means that the simple line-wise reading and the block-wise reading by custom IRS
are about 2.3 times faster (one pass in ~4 sec) than slurping the file and scanning by regular
expression.
This basically says, that if you are processing files of this size on a system like mine ;-),
you should read line-by-line if your search problem is located on one line and read
by custom input record separator if your search problem involves more than one line (my $0.02).
Want to make the test too? This one:
use strict;
use warnings;
sub getsomerandomtext {
my ($s, $n) = ('', (shift));
while($n --> 0) {
$s .= chr( rand(80) + 30 );
$s .= "\n" if rand($n) < $n/10
}
$s x 10
}
my @stuff = (
q{1234.56}, q{PINEAPPLE}, q{1!@#$%^&*()[]{};:'",<.>/?asdf}
);
my $fn = 'data.dat';
open my $fh, '>', $fn or die $!;
my $phrase='add this word to the list:';
my $x = 10000;
while($x --> 0) {
print $fh
getsomerandomtext(1000), ' ',
$phrase, ' ', $stuff[int(rand(@stuff))], ' ',
getsomerandomtext(1000), "\n",
}
close $fh;
print "done.\n";
creates the 200MB input file 'data.dat'.
Regards
rbo