views:

222

answers:

5

I try to sum the traffic of diffrent ports in the logfiles from "IPCop" so i write and command for my shell, but i think its possible to optimize the command.

First a Line from my Logfile:

01/00:03:16 kernel INPUT IN=eth1 OUT= MAC=xxx SRC=xxx DST=xxx LEN=40 TOS=0x00 PREC=0x00 TTL=98 ID=256 PROTO=TCP SPT=47438 DPT=1433 WINDOW=16384 RES=0x00 SYN URGP=0 

Now i grep with following Command the sum of all lengths who contains port 1433

grep 1433 log.dat|awk '{for(i=1;i<=10;i++)if($i ~ /LEN/)print $i};'|sed 's/LEN=//g;'|awk '{sum+=$1}END{print sum}'

The for loop i need because the LEN-col is not on same position at all time.

Any suggestion for optimizing this command?

Regards Rene

+1  A: 

If you are using gawk, you can use \< to avoid the need for the for-loop, the match(-) function to find the substring "\<LEN=.*\>", i.e., projecting out the field you want, and substr to project out the argument of LEN. You can then use just the single awk invocation to do everything.

Postscript

The regexp I gave above doesn't work, because the = character is not part of a word. The following awk script does work:

/1433/ { f=match($0,/ LEN=[[:digit:]]+ /); v=substr($0,RSTART+5,RLENGTH-6); s+=v; }
END    { print "sum=" s; }
Charles Stewart
Sorry I did not understand the gawk syntax can you explain it, please?I try grep 'PT=1433' |gawk '"\<LEN=.*\>"' this is the expression to find the string but how can i sum the fields in the gawk command?
kockiren
In the docu i found: \< Matches the empty string at the beginning of a word. For example, /\<away/ matches ‘away’ but not ‘stowaway’.But how can I use this for my cause?
kockiren
Hey Charles, I search for an solution with your way but I still not find a way with \< to save the for loop. Is there any way to search for all cols $1 ... $NF ?
kockiren
Hey Charles thx for your' solution but it is much slower as my first shellcommand. This Command needs 4.1sec and the first solution needs only 0.28sec but when I split the command to: grep 'PT=1433' mai_kernel_log.dat| awk '{f=match($0,/ LEN=[[:digit:]]+ /); v=substr($0,RSTART+5,RLENGTH-6); s+=v;}END{print s;}' I need 0.367s
kockiren
@kockiren: How are you timing the scripts?
Charles Stewart
I run the programm with debian time command such: time awk {...} I take the time with 3 Logfiles (1st 65MB, 2nd 650MB and 3rd 6.5GB) and the above time I get from 1st Logfile.
kockiren
For each logfile i run the programm 5 times and get the average time.
kockiren
A: 

If these will be on a single line, you can use perl to extract the LOG numbers and sum it.

perl -e '$f = 0; while (<>) {/.*LEN=([0-9]+).*/ ; $f += $1;} print "$f\n";' input.log

I apologise for the bad Perl. I'm not a Perl guy at all.

Noufal Ibrahim
I change your script to: >perl -e '$f = 0; while (<>) {if(/PT=1433/){/LEN=([0-9]+)/ ; $f += $1;}} print "$f\n";' log.datAnd now i get the right result. With |time i get an difference from 0.08seconds
kockiren
So i try a test with perl and shell command, if the I/O Performance is fast enough i see the compile time from perl is slower then the runtime from shellcommand.If the Logfile size is 7GB the perlcommand catch's up to the shellcommand. So i think its better to optimize the shellcommand.
kockiren
Are you saying that for smaller files, the perl command is slower and you see gains only when it reaches close to 7GB? I'm quite surprised with that result since the two command pipeline should at the least iterate through the file twice.
Noufal Ibrahim
I try booth commands (with an 65MB file) on the same machine with fast SCSI device and the shellcommand needs 0.0287s and the perl command needs 0.822sThe same test on an normal PC is better for perl, I think it is because the fast I/O Device on the servermachine show the compile time from perl and the awk, sed, grep command is compiled and this is only the real runtime from the script.
kockiren
there is one 0 to much in the runtime for the shellscript. the shellcommand needs 0.287s :-)
kockiren
+2  A: 

If it really needs optimization, as in it runs so unbearably slow: you should probably rewrite it in a more general purpose language. Even AWK could do, but I'd suggest something closer to Perl or Java for a long running extractor.

One change you could make is, rather than using an unnecessary SED and second AWK call, move the END into the first AWK call, and use split() to extract the number from LEN=num; and add it to the accumulator. Something like split($i, x, "="); sum += x[2].

The main problem is you can't write awk '/LEN=(...)/ { sum += var matching the ... }'.

TerryP
Thx for your post, this is what i want. I search for an solution to save 2nd awk and the sed Command. But i don't know how.
kockiren
Most things like that can be found in the AWK manual. I actually learned it, because Perl "copied" the idea and I know that lang like the back of my head, hehe.
TerryP
By now I change my Shellcommand to following: grep 'PT=1433' log.dat|awk '{for(i=1;i<=10;i++)if($i ~ /LEN=/)sum+=sub(/LEN=/,"",$i)}END{print sum};' But the Problem is that sub(/LEN=/,"",$i) returns 1 but not the value next to LEN= any suggestions?
kockiren
sub() performs a substitution, works kind of like sed s///. Using split() and an array index is probably best, unless performance is such an overhead that it's more effective to manipulate substrings by indexes (to avoid the cost of regexp). If performance is that critical, IMHO using a custom C or ASM app is probably best.
TerryP
I have an better solution with an runtime 0.199sec grep 'PT=1433' log.dat|awk '{for(i=1;i<=10;i++){if(sub(/LEN=/,"",$i))sum+=$i}}END{print sum;}'
kockiren
+4  A: 

Since I don't have the rep to add a comment to Noufal Ibrahims answer, here is a more natural solution using Perl.

perl -ne '$sum += $1 if /LEN=(\d+)/; END { print $sum; }' log.dat

@Noufal you can can make perl do all the hard work ;).

TerryP
I don't understand this command. I try to sum the var $1 but this var is still not define?! And there is no filter for PT=1433, too. Can you explain how to use your' syntax, please?
kockiren
Yeah. I know. It gets unreadable so fast that I usually avoid it. Nice snippet though. Thanks. +1'd you so that you can shortly post comments. :)
Noufal Ibrahim
I like unreadable stuff :-) But i don't understand your snippet :-( The -n switch stands for the while? So how $1 gets the value next to LEN=?
kockiren
$1 is the result of the first capture of a regex; in this case it will be the number next to LEN=.
Daenyth
The -n switch causes perl to wrap your code (-e 'code') inside a while loop that consumes the lines of input, assigning it to the $_ variable. Using -p instead of -n, would make it print the line also. If you need multiple regexes on the line, it can be done by extending the if statement; I used a direct grab on LEN because I thought it was the main point of interest. Perl is excellent for such tasks but can take a bit of learning to use effectively. P.S. Thanks Noufal :-)
TerryP
Hey Terry and Daenyth thx for your explaination. I test your snippet and it is slower as the other perl snippet. I change the above snippet to:perl -ne '$sum += $1 if /PT=1433/ END {print $sum;}' log.datAnd the Runtime is: 0.854s (other Perl Snippet 0.822 and Shellcommand 0.287s)
kockiren
+1  A: 

Any time you have grep/sed/awk combinations in a pipeline, you can simplify into a single awk or perl command. Here's an awk solution:

gawk -v dpt=1433 '
    $0 ~ dpt {
        for (i=1; i<=NF; i++) {
            if ($i ~ /^LEN=[[:digit:]]+/) {
                split($i, ary, /=/)
                sum += ary[2]
                next
            }
        } 
    } 
    END {print sum}
' log.dat
glenn jackman
Hey Glenn, thx for your snippet but its to slow. A test with my 65MB Logfile needs 4.122s and the snippet can not scale up to larger files with better performance.
kockiren