ansaurus

Question

Answer 1

+1 A:

If you are using gawk, you can use \< to avoid the need for the for-loop, the match(-) function to find the substring "\<LEN=.*\>", i.e., projecting out the field you want, and substr to project out the argument of LEN. You can then use just the single awk invocation to do everything.

Postscript

The regexp I gave above doesn't work, because the = character is not part of a word. The following awk script does work:

/1433/ { f=match($0,/ LEN=[[:digit:]]+ /); v=substr($0,RSTART+5,RLENGTH-6); s+=v; }
END    { print "sum=" s; }

Charles Stewart 2010-06-01 11:57:34

Sorry I did not understand the gawk syntax can you explain it, please?I try grep 'PT=1433' |gawk '"\<LEN=.*\>"' this is the expression to find the string but how can i sum the fields in the gawk command?

kockiren 2010-06-01 12:27:18

In the docu i found: \< Matches the empty string at the beginning of a word. For example, /\<away/ matches ‘away’ but not ‘stowaway’.But how can I use this for my cause?

kockiren 2010-06-01 12:54:24

Hey Charles, I search for an solution with your way but I still not find a way with \< to save the for loop. Is there any way to search for all cols $1 ... $NF ?

kockiren 2010-06-02 09:55:32

Hey Charles thx for your' solution but it is much slower as my first shellcommand. This Command needs 4.1sec and the first solution needs only 0.28sec but when I split the command to: grep 'PT=1433' mai_kernel_log.dat| awk '{f=match($0,/ LEN=[[:digit:]]+ /); v=substr($0,RSTART+5,RLENGTH-6); s+=v;}END{print s;}' I need 0.367s

kockiren 2010-06-02 13:14:22

@kockiren: How are you timing the scripts?

Charles Stewart 2010-06-02 13:36:38

I run the programm with debian time command such: time awk {...} I take the time with 3 Logfiles (1st 65MB, 2nd 650MB and 3rd 6.5GB) and the above time I get from 1st Logfile.

kockiren 2010-06-02 13:58:38

For each logfile i run the programm 5 times and get the average time.

kockiren 2010-06-02 14:06:05

Answer 2

A:

If these will be on a single line, you can use perl to extract the LOG numbers and sum it.

perl -e '$f = 0; while (<>) {/.*LEN=([0-9]+).*/ ; $f += $1;} print "$f\n";' input.log

I apologise for the bad Perl. I'm not a Perl guy at all.

Noufal Ibrahim 2010-06-01 11:57:52

I change your script to: >perl -e '$f = 0; while (<>) {if(/PT=1433/){/LEN=([0-9]+)/ ; $f += $1;}} print "$f\n";' log.datAnd now i get the right result. With |time i get an difference from 0.08seconds

kockiren 2010-06-01 12:13:53

So i try a test with perl and shell command, if the I/O Performance is fast enough i see the compile time from perl is slower then the runtime from shellcommand.If the Logfile size is 7GB the perlcommand catch's up to the shellcommand. So i think its better to optimize the shellcommand.

kockiren 2010-06-01 13:32:58

Are you saying that for smaller files, the perl command is slower and you see gains only when it reaches close to 7GB? I'm quite surprised with that result since the two command pipeline should at the least iterate through the file twice.

Noufal Ibrahim 2010-06-01 13:37:52

I try booth commands (with an 65MB file) on the same machine with fast SCSI device and the shellcommand needs 0.0287s and the perl command needs 0.822sThe same test on an normal PC is better for perl, I think it is because the fast I/O Device on the servermachine show the compile time from perl and the awk, sed, grep command is compiled and this is only the real runtime from the script.

kockiren 2010-06-01 13:59:36

there is one 0 to much in the runtime for the shellscript. the shellcommand needs 0.287s :-)

kockiren 2010-06-02 08:45:18

Answer 3

+2 A:

If it really needs optimization, as in it runs so unbearably slow: you should probably rewrite it in a more general purpose language. Even AWK could do, but I'd suggest something closer to Perl or Java for a long running extractor.

One change you could make is, rather than using an unnecessary SED and second AWK call, move the END into the first AWK call, and use split() to extract the number from LEN=num; and add it to the accumulator. Something like split($i, x, "="); sum += x[2].

The main problem is you can't write awk '/LEN=(...)/ { sum += var matching the ... }'.

TerryP 2010-06-01 12:09:14

Thx for your post, this is what i want. I search for an solution to save 2nd awk and the sed Command. But i don't know how.

kockiren 2010-06-01 13:37:22

Most things like that can be found in the AWK manual. I actually learned it, because Perl "copied" the idea and I know that lang like the back of my head, hehe.

TerryP 2010-06-01 23:29:00

By now I change my Shellcommand to following: grep 'PT=1433' log.dat|awk '{for(i=1;i<=10;i++)if($i ~ /LEN=/)sum+=sub(/LEN=/,"",$i)}END{print sum};' But the Problem is that sub(/LEN=/,"",$i) returns 1 but not the value next to LEN= any suggestions?

kockiren 2010-06-02 09:52:58

sub() performs a substitution, works kind of like sed s///. Using split() and an array index is probably best, unless performance is such an overhead that it's more effective to manipulate substrings by indexes (to avoid the cost of regexp). If performance is that critical, IMHO using a custom C or ASM app is probably best.

TerryP 2010-06-02 13:29:34

I have an better solution with an runtime 0.199sec grep 'PT=1433' log.dat|awk '{for(i=1;i<=10;i++){if(sub(/LEN=/,"",$i))sum+=$i}}END{print sum;}'

kockiren 2010-06-02 13:47:04

Answer 4

+4 A:

Since I don't have the rep to add a comment to Noufal Ibrahims answer, here is a more natural solution using Perl.

perl -ne '$sum += $1 if /LEN=(\d+)/; END { print $sum; }' log.dat

@Noufal you can can make perl do all the hard work ;).

TerryP 2010-06-01 12:12:38

I don't understand this command. I try to sum the var $1 but this var is still not define?! And there is no filter for PT=1433, too. Can you explain how to use your' syntax, please?

kockiren 2010-06-01 13:36:09

Yeah. I know. It gets unreadable so fast that I usually avoid it. Nice snippet though. Thanks. +1'd you so that you can shortly post comments. :)

Noufal Ibrahim 2010-06-01 13:39:30

I like unreadable stuff :-) But i don't understand your snippet :-( The -n switch stands for the while? So how $1 gets the value next to LEN=?

kockiren 2010-06-01 13:51:49

$1 is the result of the first capture of a regex; in this case it will be the number next to LEN=.

Daenyth 2010-06-01 17:07:14

The -n switch causes perl to wrap your code (-e 'code') inside a while loop that consumes the lines of input, assigning it to the $_ variable. Using -p instead of -n, would make it print the line also. If you need multiple regexes on the line, it can be done by extending the if statement; I used a direct grab on LEN because I thought it was the main point of interest. Perl is excellent for such tasks but can take a bit of learning to use effectively. P.S. Thanks Noufal :-)

TerryP 2010-06-01 23:34:39

Hey Terry and Daenyth thx for your explaination. I test your snippet and it is slower as the other perl snippet. I change the above snippet to:perl -ne '$sum += $1 if /PT=1433/ END {print $sum;}' log.datAnd the Runtime is: 0.854s (other Perl Snippet 0.822 and Shellcommand 0.287s)

kockiren 2010-06-02 08:40:21

Answer 5

+1 A:

Any time you have grep/sed/awk combinations in a pipeline, you can simplify into a single awk or perl command. Here's an awk solution:

gawk -v dpt=1433 '
    $0 ~ dpt {
        for (i=1; i<=NF; i++) {
            if ($i ~ /^LEN=[[:digit:]]+/) {
                split($i, ary, /=/)
                sum += ary[2]
                next
            }
        } 
    } 
    END {print sum}
' log.dat

glenn jackman 2010-06-01 17:02:55

Hey Glenn, thx for your snippet but its to slow. A test with my 65MB Logfile needs 4.122s and the snippet can not scale up to larger files with better performance.

kockiren 2010-06-02 08:52:55

ansaurus

tags:

views:

answers:

Optimize grep, awk and sed shell stuff

related questions