tags:

views:

71

answers:

4

I'm trying to split a large log file, containing log entries for months at a time, and I'm trying to split it up into logfiles by date. There are thousands of line as follows:

Sep 4 11:45 kernel: Entry
Sep 5 08:44 syslog: Entry

I'm trying to split it up so that the files, logfile.20090904 and logfile.20090905 contain the entries.

I've created a program to read each line, and send it to the appropriate file, but it runs pretty slow (especially since I have to turn a month name to a number). I've thought about doing a grep for every day, which would require finding the first date in the file, but that seems slow as well.

Is there a more optimal solution? Maybe I'm missing a command line program that would work better.

Here is my current solution:

#! /bin/bash
cat $FILE | while read line; do
  dts="${line:0:6}"
  dt="`date -d "$dts" +'%Y%m%d'`"
  # Note that I could do some caching here of the date, assuming
  # that dates are together.
  echo $line >> $FILE.$dt 2> /dev/null
done
+1  A: 

The quickest thing given what you've already done would be to simply name the files "Sep 4" and so on, then rename them all at the end - that way all you have to do is read a certain number of characters, no extra processing.

If for some reason you don't want to do that, but you know the dates are in order, you could cache the previous date in both forms, and do a string comparison to find out whether you need to run date again or just use the old cached date.

Finally, if speed really keeps being an issue, you could try perl or python instead of bash. You're not doing anything too crazy here, though (besides starting a subshell and date process every line, which we already figured out how to avoid), so I don't know how much it'll help.

Jefromi
Ah, good thinking.
bradlis7
It still takes over 15 seconds to process 70,000 lines, but grep takes 0.072 user seconds to do it all. I don't know how grep is so fast. Maybe it is because it is compiled.
bradlis7
grep isn't writing to a file. You're opening and closing the file for writing 70000 times. idimba's answer helps address this side of the problem - though I think at that point I'd switch languages. It just seems so much easier to write open and close than muck about with file descriptors.
Jefromi
Ah, that makes a lot of sense. This has never been mentioned in the BASH tutorials that I've seen, or I may have just skipped over it...
bradlis7
It's not really something that anyone would think to mention. Since you don't have to manually open the file for writing, it's clearly doing it for you, and since you didn't open it, the close has to be automatic to, and so it's necessarily once the write's complete. gawk answer above is going to be lots faster though. Please upvote the helpful answers and accept the one you use/most helpful!
Jefromi
Ah, true.I can't upvote, as I only have 11 points, and need 15. I did accept one though.
bradlis7
+1  A: 

A skeleton of script:

BIG_FILE=big.txt

# remove $BIG_FILE when the script exits
trap "rm -f $BIG_FILE" EXIT

cat $FILES > $BIG_FILE || { echo "cat failed"; exit 1 }

# sort file by date in place
sort -M $BIG_FILE -o $BIG_FILE || { echo "sort failed"; exit 1 }

while read line;
   # extract date part from line ...
   DATE_STR=${line:0:12} 

   # a new date - create a new file
   if (( $DATE_STR != $PREV_DATE_STR)); then 
       # close file descriptor of "dated" file
       exec 5>&- 
       PREV_DATE_STR=$DATE_STR

       # open file of a "dated" file for write
       FILE_NAME= ... set to file name ...
       exec 5>$FILE_NAME || { echo "exec failed"; exit 1 }
   fi

   echo -- $line >&5 || { echo "print failed"; exit 1 }
done < $BIG_FILE
dimba
bradlis7
dimba
I see. Bash is usually pretty easy to understand, but these weird things like this throw me for a loop at times. I'll try to implement and see what kind of timing I get on this. Thanks.
bradlis7
These a bit more advanced stuff, but it's not to compicated for what you need. Very cool stuff can achived with it. Waiting to hear results :)
dimba
Hey, the "return" needs to be changed to "exit", as it's not in a function, and print is complaining that -u is not a valid argument. Is that a typo? I could just use `echo $line > 5` right?
bradlis7
dimba
bradlis7
How do you extract date part from string?
dimba
Try "DATE_STR=${line:0:12}" - take 1st 12 chars
dimba
No, I meant 12 second running time.
bradlis7
I know. We need to remove all external (non bash) commands from the while loop. "DATE_STR=${line:0:12}" will extract date from $line by using bash only
dimba
Oh, well I'm using Jefromi's method, extracting just a string, and then going back and renaming using `date`, so it wasn't part of what I timed.
bradlis7
A: 

This script executes the inner loop 365 or 366 times, once for each day of the year, instead of iterating over each line of the log file:

#!/bin/bash
month=0
months=(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec)
for eom in 31 29 31 30 31 30 31 31 30 31 30 31
do
    (( month++ ))
    echo "Month $month"
    if (( month == 2 ))    # see what day February ends on
    then
        eom=$(date -d "3/1 - 1 day" +%-d)
    fi
    for (( day=1; day<=eom; day++ ))
    do
        grep "^${months[$month - 1]} $day " dates.log > temp.out
        if [[ -s temp.out ]]
        then
            mv temp.out file.$(date -d $month/$day +"%Y%m%d")
        else
            rm temp.out
        fi
        # instead of creating a temp file and renaming or removing it,
        # you could go ahead and let grep create empty files and let find
        # delete them at the end, so instead of the grep and if/then/else
        # immediately above, do this:
        # grep --color=never "^${months[$month - 1]} $day " dates.log > file.$(date -d $month/$day +"%Y%m%d")
    done
done
# if you let grep create empty files, then do this:
# find -type f -name "file.2009*" -empty -delete
Dennis Williamson
That's pretty creative with the end of months, but not the way I was hoping for. It would seem to me that grep would be inefficient, even though the results seem to be faster.
bradlis7
+2  A: 

@OP try not to use bash's while read loop to iterate a big file. Its tried and proven that its slow, and furthermore, you are calling external date command for every line of the file you read. Here's a more efficient way, using only gawk

gawk 'BEGIN{
 m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",mth,"|")     
}
{ 
 for(i=1;i<=m;i++){ if ( mth[i]==$1){ month = i } }
 tt="2009 "month" "$2" 00 00 00" 
 date= strftime("%Y%m%d",mktime(tt))
 print $0 > FILENAME"."date
}
' logfile

output

$ more logfile
Sep 4 11:45 kernel: Entry
Sep 5 08:44 syslog: Entry

$ ./shell.sh

$ ls -1 logfile.*
logfile.20090904
logfile.20090905

$ more logfile.20090904
Sep 4 11:45 kernel: Entry

$ more logfile.20090905
Sep 5 08:44 syslog: Entry
ghostdog74
Nice work... I thought about sed, but I didn't see what I was looking for but this looks great.
bradlis7
2.8 seconds! That will work for me.
bradlis7
+1 Awk to the rescue! Very nice!
Dennis Williamson