views:

223

answers:

6

I am reading a postfix mail log file into an array and then looping through it to extract messages. On the first pass, I'm checking for a match on the "to=" line and grabbing the message ID. After building an array of MSGIDs, I'm looping back through the array to extract information on the to=, from=, and client= lines.

What I'd like to do is remove a line from the array as soon as I've extracted the data from it in order to make the processing a bit faster (i.e. one less line to check against).

Any suggestions? This is in Perl.


Edit: gbacon's answer below was enough to get me rolling with a solid solution. Here's the guts of it:

my %msg;
while (<>) {
    my $line = $_;
    if (s!^.*postfix/\w+\[.+?\]: (\w+):\s*!!) {
            my $key = $1;
            push @{ $msg{$key}{$1} } => $2
                    while /\b(to|from|client|size|nrcpt)=<?(.+?)(?:>|,|\[|$)/g;
    }
    if ($line =~ s!^(\w+ \d+ \d+:\d+:\d+)\s(\w+.*)\s+postfix/\w+\[.+?\]: (\w+):\s*removed!!) {
            my $key = $3;
            push @{ $msg{$key}{date} } => $1;
            push @{ $msg{$key}{server} } => $2;
    }
}

use Data::Dumper;
$Data::Dumper::Indent = 1;
print Dumper \%msg;

I'm sure that second regexp can be made more impressive, but it gets the job done for what I need. I can now take the hash of all messages and pull out the ones I'm interested in.

Thanks to all who answered.

+4  A: 

It won't actually make the processing faster, as removing from the middle of an array is an expensive operation.

Better options:

  • Do everything in one pass
  • When you build the array of IDs, include pointers (indexes, really) into the main array so that you can access its elements quickly for a given ID
Eli Bendersky
A: 

In perl you can use the splice() routine to remove elements from an array.

As usual, use caution when deleting from an array when looping through an array as your array indexes will change.

Ken Aspeslagh
A: 

Assuming you have the index at hand, use splice:

splice(@array, $indextoremove, 1)

But be careful. Your index will be invalid once you remove an element.

Vivin Paliath
A: 

Common methods for manipulating the contents of an array:

# start over with this list for each example:
my @list = qw(a b c d);

splice:

splice @list, 2, 1, qw(e);
# @list now contains: qw(a b e d)

pop and unshift:

pop @list;
# @list now contains: qw(a b c)

unshift @list;
# @list now contains: qw(b c d)

map:

@list = map { $_ eq 'b' ? () : $_ } @list;
# list now contains: qw(a c d);

array slices:

@list[3..4] = qw(e f);
# list now contais: qw(a b c e f);

for and foreach loops:

foreach (@list)
{
    # $_ is aliased to each element of the list in turn;
    # assignments will be propogated back to the original structure
    $_ = uc if m/[a-c]/;
}
# list now contains: qw(A B C d);

Read about all these functions at perldoc perlfunc, slices in perldoc perldata, and for loops in perldoc perlsyn.

Ether
+1  A: 

Why not do this:

my @extracted = map  extract_data($_), 
                grep msg_rcpt_to( $rcpt, $_ ), @log_data;

When you are done, you'll have an array of extracted data in the same order it appeared in the log.

daotoad
+5  A: 

Do it in a single pass:

#! /usr/bin/perl

use warnings;
use strict;

# for demo only
*ARGV = *DATA;

my %msg;
while (<>) {
  if (s!^.*postfix/\w+\[.+?\]: (\w+):\s*!!) {
    my $key = $1;
    push @{ $msg{$key}{$1} } => $2
      while /\b(to|from|client)=(.+?)(?:,|$)/g;
  }
}

use Data::Dumper;
$Data::Dumper::Indent = 1;
print Dumper \%msg;
__DATA__
Apr  8 14:22:02 MailSecure03 postfix/smtpd[32388]: BA1CE38965: client=mail.example.com[x.x.x.x]
Apr  8 14:22:03 MailSecure03 postfix/cleanup[32070]: BA1CE38965: message-id=<[email protected]>
Apr  8 14:22:03 MailSecure03 postfix/qmgr[19685]: BA1CE38965: from=<[email protected]>, size=1087, nrcpt=2 (queue active)
Apr  8 14:22:04 MailSecure03 postfix/smtp[32608]: BA1CE38965: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10025, delay=1.7, delays=1/0/0/0.68, dsn=2.0.0, status=sent (250 OK, sent 49DC509B_360_15637_162D8438973)
Apr  8 14:22:04 MailSecure03 postfix/smtp[32608]: BA1CE38965: to=<[email protected]>, relay=127.0.0.1[127.0.0.1]:10025, delay=1.7, delays=1/0/0/0.68, dsn=2.0.0, status=sent (250 OK, sent 49DC509B_360_15637_162D8438973)
Apr  8 14:22:04 MailSecure03 postfix/qmgr[19685]: BA1CE38965: removed
Apr  8 14:22:04 MailSecure03 postfix/smtpd[32589]: 62D8438973: client=localhost.localdomain[127.0.0.1]
Apr  8 14:22:04 MailSecure03 postfix/cleanup[32080]: 62D8438973: message-id=<[email protected]>
Apr  8 14:22:04 MailSecure03 postfix/qmgr[19685]: 62D8438973: from=<[email protected]>, size=1636, nrcpt=2 (queue active)
Apr  8 14:22:04 MailSecure03 postfix/smtp[32417]: 62D8438973: to=<[email protected]>, relay=y.y.y.y[y.y.y.y]:25, delay=0.19, delays=0.04/0/0.04/0.1, dsn=2.6.0, status=sent (250 2.6.0  <[email protected]> Queued mail for delivery)
Apr  8 14:22:04 MailSecure03 postfix/smtp[32417]: 62D8438973: to=<[email protected]>, relay=y.y.y.y[y.y.y.y]:25, delay=0.19, delays=0.04/0/0.04/0.1, dsn=2.6.0, status=sent (250 2.6.0  <[email protected]> Queued mail for delivery)
Apr  8 14:22:04 MailSecure03 postfix/qmgr[19685]: 62D8438973: removed

The code works by first looking for a queue ID (e.g., BA1CE38965 and 62D8438973 above), which we store in $key.

Next, we find all matches on the current line (thanks to the /g switch) that look like to=<...>, client=mail.example.com, and so on—with and without the separating comma.

Of note in the pattern are

  • \b - matches on a word boundary only (prevents matching xxxto=<...>)
  • (to|from|client) - match to or from or client
  • (.+?) - matches the field's value with a non-greedy quantifier
  • (?:,|$) - matches either a comma or at end of string without capturing into $3

The non-greedy (.+?) forces the match to stop at the first comma it encounters rather than the last. Otherwise, on a line with

to=<[email protected]>, other=123

you'd get <[email protected]>, other=123 as the recipient!

Then for each field matched, we push it onto the end of an array (because there may be multiple recipients, for example) connected to both the queue ID and field name. Take a look at the result:

$VAR1 = {
  '62D8438973' => {
    'client' => [
      'localhost.localdomain[127.0.0.1]'
    ],
    'to' => [
      '<[email protected]>',
      '<[email protected]>'
    ],
    'from' => [
      '<[email protected]>'
    ]
  },
  'BA1CE38965' => {
    'client' => [
      'mail.example.com[x.x.x.x]'
    ],
    'to' => [
      '<[email protected]>',
      '<[email protected]>'
    ],
    'from' => [
      '<[email protected]>'
    ]
  }
};

Now say you want to print all the recipients of the message whose queue ID is BA1CE38965:

my $queueid = "BA1CE38965";
foreach my $recip (@{ $msg{$queueid}{to} }) {
  print $recip, "\n":
}

Maybe you want to know only how many recipients:

print scalar @{ $msg{$queueid}{to} }, "\n";

If you're willing to assume each message has exactly one client, access it with

print $msg{$queueid}{client}[0], "\n";
Greg Bacon
This is fantastic, thank you... I was focused on pulling out only the messages I'm interested in (ones that match [0-9-]@ACertainDomain.com) and didn't think about just loading up all the pertinent info from the file into a hash and then pulling messages out of that.I'm going to use your code as a foundation and see if I can't build up from there. I'm sure I'll have more questions (I'm still trying to parse that 'while' regexp, I'm so rusty at this).
Justin
@Justin You're welcome! See updated explanation.
Greg Bacon
Thanks again. My parse now takes about 3 minutes per file as opposed to 3 hours. This community is awesome.
Justin
@Justin Tell your friends!
Greg Bacon