tags:

views:

102

answers:

4

Is there any way I could optimize the following script to run faster?

foreach my $arg (@data){ #  
   @score=();
   `program $arg $arg1 > $result`; #!!! $arg1 is a very large file with lots of data!!!
   open(FH,$result);
   while(<FH>){
      chomp;
      if($_ =~ /\d+.+\s+(\d+\.\d+|\d+\.|\.\d+).+/){ #here i'm looking for any number such as: 21.343 or 12 or 0.22 or -3.0
         push(@score, $1);
      }
   }
   close FH;
   @sorted = sort{$a <=> $b} @score; #a sorted score is what i actually want
}
+2  A: 

Why couldn't you simply run the program and pipe the results to your perl script?

./program $arg $arg1 | myscript

Actually, you could probably get rid of the perl entirely:

./program $arg $arg1 | grep /\d...whatever.../ | sort
chris
Pipe to grep is how I'd do it too. On Unix. Maybe he's on Windows and all he has installed is Perl and he doesn't want to install Cygwin.
Zan Lynx
The other problem is that this is obviously a snippet of a larger program (see the @data array). I really wanted to write a bash example that does that loop, but who knows what he has going on in perl.
Mike Axiak
@Mike: Well, the first option would take care of the larger program problem. Guess we should have asked "how slow". :)
chris
Even if its part of a larger program, my @lines = `program $arg $arg1 | grep ...` would still cut out the middle man of having to write out and read a file back in.
Schwern
`grep` from http://gnuwin32.sf.net/packages.html is an alternative to Cygwin.
daxim
+5  A: 

There are a few things I can see (for instance not loading your result into the file immediately), but I suspect the main performance benefit you will get will probably be from using a different regex. To that end, do you have a better idea what the data output format from your program is?

Here's some sample perl that may run a little bit quicker:

use strict;
foreach my $arg (@data){
  my @score=();
  open(my $fh, "program $arg $arg1 |");
  while (<$fh>) {
    chomp;
    if (/\d+.+\s+((\d+)?\.?\d+)/o) {
      push(@score, $1);
    }
  }
  close($fh);
  my @sorted = sort { $a <=> $b } @score;
}

Notice a few things here:

  1. I'm using a program file handler so that I'm not using a temporary file, thus skipping a whole pass of data.
  2. I changed the regex to use nested groups rather than multiple options.
  3. I use strict and keep package names (for the love of God use strict in your perl).

The other people have said to use threads. You DO NOT need to do this, as running the process as I have done with the trailing pipe (|) in the open function causes perl to fork a process for you. Then you use standard unix pipes to read from the program asynchronously.

Mike Axiak
I think you're not understanding the thread recommendation. If he turns the `foreach my $arg (@data)` loop into a bunch of threads, he can run `program` two or more times in parallel, thus potentially speeding up his program that way. Putting a pipe in the open function doesn't do this. (As far as I know, and it would be incredible to have that happen.)
CanSpice
Ah I did misunderstand that, thanks :-)
Mike Axiak
A: 

Yup, first of all: redirecting program output to file, and reading it afterwards is stupid & expensive. Why not just?

my @result = `program $arg $arg1`;
foreach(@result) {...

Second thing is you can parallelize the outer foreach. perldoc threads, threads::shared.

hlynur
-1 because he says the result is a very large file. Reading it into a Perl list will likely overflow his RAM.
Zan Lynx
He says $arg1 is very large file. He didn't say *program* output is very large.
hlynur
This is why perl allows you to use pipe in open(): http://perldoc.perl.org/perlipc.html#Using-open()-for-IPC
Mike Axiak
That program you provided doesn't even work the way you think it does. Please delete this answer.
Brad Gilbert
@Brad Gilbert nnaah... I'll leave it as it is.
hlynur
`my @result =` **`split /\n/`** `\`program $arg $arg1\`;`
Brad Gilbert
Sure, if you mind the trailing newlines.
hlynur
+2  A: 

Have you profiled your program? Without profiling, you don't know if the vast majority of the time is spent in the external program or in your program.

Profiling is an important step in optimization, and without it, you're essentially guessing where speed improvements can be made. Profiling will show you which steps are taking the most amount of time.

That said, as hlynur said, you could probably parallelize your external program calls using threads. You might also gain some optimizations through a different regular expression, but there's no real way to tell how much you'll gain without profiling first.

CanSpice