views:

81

answers:

3

This is difficult to describe but useful in extracting data in the output I am dealing with (I hope to use this code for a large number of purposes)

Here is an example: Say I have a text file with words and some special characters ($, #, !, etc) that reads:


blah blah
blah add this word to the list: 1234.56 blah blah
blah blah
blah now don't forget to add this word to the list: PINAPPLE blah blah
And for bonus points,
it would be nice to know that the script
would be able to add this word to the list: 1!@#$%^&*()[]{};:'",<.>/?asdf blah blah
blah blah


As the example implies, I would like to add whatever "word" (defined as any string that does not contain spaces in this context) to some form of list such that I can extract elements of the list as list[2] list[3] or list(4) list(5), or something along those lines.

This would be very versatile, and after some questioning in another thread and another forum, I am hoping that having it in perl would make it relatively fast in execution--so it will work well even for large text files. I intend to use this to read data from output files generated from different programs regardless of structure of the output file, i.e. if I know the string to search for, I can get the data.

+2  A: 

I think there are some missing words in your question :) But this sounds like what you want (assuming even the "large text files" fit in memory - if not, you'd loop through line by line pushing onto $list instead).

my $filecontents = File::Slurp::read_file("filename");
@list = $filecontents =~ /add this word to the list: (\S+)/g;
ysth
A: 

How about:

my(@list);
my $rx = qr/.*add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

This allows for long lines containing more than one of the 'add' markers. If there definitively can only be one, replace the inner while with if. (Except, of course, that I used a greedy '.*' which snaffles up everything to the last occurrence of the match...)

my(@list);
my $rx = qr/(?:.*?)add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

With a selectable marker:

my $marker = "add this word to the list:";
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

With no repeats:

my $marker = "add this word to the list:";
my(%hash);
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1 unless defined $hash{$1};
          $hash{$1} = 1;
          s/$rx//;
     }
}

Etc.


And, as @ysth points out, you (I) don't need the substitution - Perl DWIM's correctly a g-qualified match in the inner loop:

#!/bin/perl -w
use strict;
my(@list);
my(%hash);
my($marker) = "add this word to the list:";
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
    while (m/$rx/g)
    {
        push @list, $1 unless defined $hash{$1};
        $hash{$1} = 1;
    }
}

foreach my $i (@list)
{
    print "$i\n";
}
Jonathan Leffler
hold on. I should clarify in a comment by the question
Feynman
Where is the file name declared?
Feynman
@Feynman: nowhere - or on the command line. The '`while (<>)`' notation reads each line from each file specified on the command line, or from standard input if there are no such files. If you need to read from a file, open the file and read from the file handle instead of the '<>' file handle.
Jonathan Leffler
No need for the substitution; using `while (m/$rx/g)` will loop through the available matches
ysth
A: 

If the string for the searches is the same, let Perl do the processing by using the search phrase as input record separator:

open my $fh, '<', 'test.dat' or die "can't open $!"; # usual way of opening a file

my @list;                                            # declare empty array 'list' (results)
$/= 'add this word to the list:';                    # define custom input  record seperator

while( <$fh> ) {                                     # read records one by one
   push @list, $1 if /(\S\S*)/
}
close $fh;                                           # thats it, close file!

print join "\n", @list;                              # this will list the results

The above is "almost ok", it will save the first word of the file in $list[0] because of the way of the processing. But this way makes it very easy to comprehend (imho)

blah                 <== first word of the file
1234.56
PINAPPLE
1!@#$%^&*()[]{};:'",<.>/?asdf

Q: why not simply look the strings up with one regex over the entire data (as has already been suggested here). Because in my experience, the record-wise procesing with per-record regular expression (probably very complicated regex in a real use case) will be faster - especially on very large files. Thats the reason.


Real world test

To back this claim up, I performed some tests with a 200MB data file containing 10,000 of your markers. The test source follows:

use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
use FILE::Slurp;
# 'data.dat', a 200MB data file, containing 10_000
# markers: 'add this word to the list:' and a
# one of different data items after each.

my $t = timethese(10,
 {
  'readline+regex' => sub { # trivial reading line-by-line
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     my @list;                                            
                     while(<$fh>) { 
                        push @list,$1 if /add this word to the list:\s*(\S+)/
                     }
                     close $fh;                                           
                     return scalar @list;   
                  },
  'readIRS+regex' => sub { # treat each 'marker' as start of an input record
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     $/= 'add this word to the list:';    # new IRS                
                     my @list;                                            
                     while(<$fh>) { push @list, $1 if /(\S+)/ }       
                     close $fh;                                           
                     return scalar @list;   
                  },
  'slurp+regex' => sub { # read the whole file and apply regular expression
                     my $filecontents = File::Slurp::read_file('data.dat');
                     my @list = $filecontents =~ /add this word to the list:\s*(\S+)/g;
                     return scalar @list;
                  },
 }
);
cmpthese( $t ) ;

which outputs the following timing results:

Benchmark: timing 10 iterations of readIRS+regex, readline+regex, slurp+regex...
readIRS+regex: 43 wallclock secs (37.11 usr +  5.48 sys = 42.59 CPU) @  0.23/s (n=10)
readline+regex: 42 wallclock secs (36.47 usr +  5.49 sys = 41.96 CPU) @  0.24/s (n=10)
slurp+regex: 142 wallclock secs (135.85 usr +  4.98 sys = 140.82 CPU) @  0.07/s (n=10)
               s/iter    slurp+regex  readIRS+regex readline+regex
slurp+regex      14.1             --           -70%           -70%
readIRS+regex    4.26           231%             --            -1%
readline+regex   4.20           236%             1%             --

which basically means that the simple line-wise reading and the block-wise reading by custom IRS are about 2.3 times faster (one pass in ~4 sec) than slurping the file and scanning by regular expression.

This basically says, that if you are processing files of this size on a system like mine ;-), you should read line-by-line if your search problem is located on one line and read by custom input record separator if your search problem involves more than one line (my $0.02).

Want to make the test too? This one:

use strict;
use warnings;

sub getsomerandomtext {
    my ($s, $n) = ('', (shift));
    while($n --> 0) {
        $s .= chr( rand(80) + 30 );
        $s .= "\n" if rand($n) < $n/10
    }
    $s x 10
}

my @stuff = (
 q{1234.56}, q{PINEAPPLE}, q{1!@#$%^&*()[]{};:'",<.>/?asdf}
);

my $fn = 'data.dat';
open my $fh, '>', $fn or die $!;

my $phrase='add this word to the list:';
my $x = 10000;

while($x --> 0) {
   print $fh
      getsomerandomtext(1000),  ' ',
      $phrase, ' ', $stuff[int(rand(@stuff))],  ' ',
      getsomerandomtext(1000), "\n",
}

close $fh;
print "done.\n";

creates the 200MB input file 'data.dat'.

Regards

rbo

rubber boots
Wow. Thank you for the work and explanation. I never expected anyone to do a benchmark! And you even explained all details of the result--for which I am very grateful because I am not very familiar with perl. I am using these scripts in a larger program that reads outputs from quantum chemistry packages, and formats them into input to be run to determine more information.Wow, I do not think I will be working with 200MB files at first, but then again, I am hoping to get this out as an open source program and anyone dealing with proteins will most certainly be dealing with files of this size.
Feynman
Actually, I have a quick question. Will the custom IRS search or any of those methods be automatically parallelized if I have more than one CPU? I do not have much experience with this parallelization, but I do have access to more than one CPU.
Feynman
@Feynman: in order to utilize multiple CPU cores for one problem, you have to create a **multithreaded** program (see: http://perldoc.perl.org/perlthrtut.html and scroll down to examples). But I simply would not do that. That will overcomplicate your solution, introduce hidden problem spots and won't help much because you are **mainly limited by filesystem I/O**. How large are your to-be-processed files anyway? How complicated are the search problems?
rubber boots
The search problems should never be very complicated. The files are organized in these blocks of data. The files should not exceed a few hundred lines and will most likely be on the order of dozens of lines. One problem that I would like to watch out for is the possibility of having dozens of such files. Now that still may not seem like much to someone who uses 200MB files but here is the catch: Say the search is executed to pull out a few energy levels for a few atoms. Say I want a graph. Then I have to repeat this whole process hundreds (or thousands) of times over.
Feynman
I certainly overestimated the size of these "protein databank" files that I thought for sure would be at least on the order of MB. However they seem to be interconnected (they refer to each other). I should have foreseen this. Trying to develop the ability to work with these files may be a whole new can of worms. Since working with these are not my primary concern (I just thought the feature would be nice if/when I get the code on the net).My main task is dealing with small files (relative to 200MB) with small numbers of matches as efficiently as possible.
Feynman
@Feynman: for Perl, there are already a lot of modules available (http://www.perlmol.org/) that deal with chemistry formats. Can you give some specific information regarding your problem (what file formats exactly, what to do with them etc.). What programming language would you normally use?(see also: http://search.cpan.org/~itub/Chemistry-File-QChemOut-0.10/QChemOut.pm)
rubber boots
Wow I did not know that! Specifically--I am trying to just get total energy from a GAMESS calculation. DOES PERL HAVE A BUILT IN FUNCTION FOR THAT??? Or things like that? I searched pretty hard for stuff like that before I set out to do it myself. I have never heard of this perlmol. I will be spending quite some time on that site you sent me if has the sort of things I am looking for. Thank you very much for showing me!
Feynman
@Feynman: Perl itself doesn't contain such specific stuff, but with GAMESS, there comes at least one tool in Perl () http://www.ualberta.ca/dept/chemistry/computational/gamess_0703r5/tools/globop_extract - maybe that useful already? Here is the description: http://phoenix.liu.edu/~nmatsuna/gamess/input/GLOBOP.html
rubber boots
Yes, that is helpful as an example code. So would you suggest building a script out of perlmol or starting from scratch as I have been doing? It seems that perlmol can find coordinate related properties but nothing else as far as I can tell.
Feynman