ansaurus

Question

Answer 1

+1 A:

The quickest thing given what you've already done would be to simply name the files "Sep 4" and so on, then rename them all at the end - that way all you have to do is read a certain number of characters, no extra processing.

If for some reason you don't want to do that, but you know the dates are in order, you could cache the previous date in both forms, and do a string comparison to find out whether you need to run date again or just use the old cached date.

Finally, if speed really keeps being an issue, you could try perl or python instead of bash. You're not doing anything too crazy here, though (besides starting a subshell and date process every line, which we already figured out how to avoid), so I don't know how much it'll help.

Jefromi 2009-10-28 18:51:19

Ah, good thinking.

bradlis7 2009-10-28 18:57:08

It still takes over 15 seconds to process 70,000 lines, but grep takes 0.072 user seconds to do it all. I don't know how grep is so fast. Maybe it is because it is compiled.

bradlis7 2009-10-28 19:21:00

grep isn't writing to a file. You're opening and closing the file for writing 70000 times. idimba's answer helps address this side of the problem - though I think at that point I'd switch languages. It just seems so much easier to write open and close than muck about with file descriptors.

Jefromi 2009-10-28 19:26:53

Ah, that makes a lot of sense. This has never been mentioned in the BASH tutorials that I've seen, or I may have just skipped over it...

bradlis7 2009-10-28 19:51:42

It's not really something that anyone would think to mention. Since you don't have to manually open the file for writing, it's clearly doing it for you, and since you didn't open it, the close has to be automatic to, and so it's necessarily once the write's complete. gawk answer above is going to be lots faster though. Please upvote the helpful answers and accept the one you use/most helpful!

Jefromi 2009-10-29 05:15:10

Ah, true.I can't upvote, as I only have 11 points, and need 15. I did accept one though.

bradlis7 2009-10-29 14:31:04

Answer 2

+1 A:

A skeleton of script:

BIG_FILE=big.txt

# remove $BIG_FILE when the script exits
trap "rm -f $BIG_FILE" EXIT

cat $FILES > $BIG_FILE || { echo "cat failed"; exit 1 }

# sort file by date in place
sort -M $BIG_FILE -o $BIG_FILE || { echo "sort failed"; exit 1 }

while read line;
   # extract date part from line ...
   DATE_STR=${line:0:12} 

   # a new date - create a new file
   if (( $DATE_STR != $PREV_DATE_STR)); then 
       # close file descriptor of "dated" file
       exec 5>&- 
       PREV_DATE_STR=$DATE_STR

       # open file of a "dated" file for write
       FILE_NAME= ... set to file name ...
       exec 5>$FILE_NAME || { echo "exec failed"; exit 1 }
   fi

   echo -- $line >&5 || { echo "print failed"; exit 1 }
done < $BIG_FILE

dimba 2009-10-28 19:22:53

bradlis7 2009-10-28 19:50:10

dimba 2009-10-28 19:56:57

I see. Bash is usually pretty easy to understand, but these weird things like this throw me for a loop at times. I'll try to implement and see what kind of timing I get on this. Thanks.

bradlis7 2009-10-28 20:00:06

These a bit more advanced stuff, but it's not to compicated for what you need. Very cool stuff can achived with it. Waiting to hear results :)

dimba 2009-10-28 20:02:42

Hey, the "return" needs to be changed to "exit", as it's not in a function, and print is complaining that -u is not a valid argument. Is that a typo? I could just use `echo $line > 5` right?

bradlis7 2009-10-28 20:20:41

dimba 2009-10-28 20:29:33

bradlis7 2009-10-28 20:30:14

How do you extract date part from string?

dimba 2009-10-28 20:48:20

Try "DATE_STR=${line:0:12}" - take 1st 12 chars

dimba 2009-10-28 20:51:52

No, I meant 12 second running time.

bradlis7 2009-10-28 21:02:38

I know. We need to remove all external (non bash) commands from the while loop. "DATE_STR=${line:0:12}" will extract date from $line by using bash only

dimba 2009-10-28 21:15:16

Oh, well I'm using Jefromi's method, extracting just a string, and then going back and renaming using `date`, so it wasn't part of what I timed.

bradlis7 2009-10-29 14:16:16

Answer 3

A:

This script executes the inner loop 365 or 366 times, once for each day of the year, instead of iterating over each line of the log file:

#!/bin/bash
month=0
months=(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)
for eom in 31 29 31 30 31 30 31 31 30 31 30 31
do
    (( month++ ))
    echo "Month $month"
    if (( month == 2 ))    # see what day February ends on
    then
        eom=$(date -d "3/1 - 1 day" +%-d)
    fi
    for (( day=1; day<=eom; day++ ))
    do
        grep "^${months[$month - 1]} $day " dates.log > temp.out
        if [[ -s temp.out ]]
        then
            mv temp.out file.$(date -d $month/$day +"%Y%m%d")
        else
            rm temp.out
        fi
        # instead of creating a temp file and renaming or removing it,
        # you could go ahead and let grep create empty files and let find
        # delete them at the end, so instead of the grep and if/then/else
        # immediately above, do this:
        # grep --color=never "^${months[$month - 1]} $day " dates.log > file.$(date -d $month/$day +"%Y%m%d")
    done
done
# if you let grep create empty files, then do this:
# find -type f -name "file.2009*" -empty -delete

Dennis Williamson 2009-10-28 22:10:42

That's pretty creative with the end of months, but not the way I was hoping for. It would seem to me that grep would be inefficient, even though the results seem to be faster.

bradlis7 2009-10-29 14:29:42

Answer 4

+2 A:

@OP try not to use bash's while read loop to iterate a big file. Its tried and proven that its slow, and furthermore, you are calling external date command for every line of the file you read. Here's a more efficient way, using only gawk

gawk 'BEGIN{
 m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",mth,"|")     
}
{ 
 for(i=1;i<=m;i++){ if ( mth[i]==$1){ month = i } }
 tt="2009 "month" "$2" 00 00 00" 
 date= strftime("%Y%m%d",mktime(tt))
 print $0 > FILENAME"."date
}
' logfile

output

$ more logfile
Sep 4 11:45 kernel: Entry
Sep 5 08:44 syslog: Entry

$ ./shell.sh

$ ls -1 logfile.*
logfile.20090904
logfile.20090905

$ more logfile.20090904
Sep 4 11:45 kernel: Entry

$ more logfile.20090905
Sep 5 08:44 syslog: Entry

ghostdog74 2009-10-29 02:12:06

Nice work... I thought about sed, but I didn't see what I was looking for but this looks great.

bradlis7 2009-10-29 14:28:47

2.8 seconds! That will work for me.

bradlis7 2009-10-29 14:37:59

+1 Awk to the rescue! Very nice!

Dennis Williamson 2009-10-29 17:00:09

ansaurus

tags:

views:

answers:

Arrange Log Entries into Dated Files

related questions