tags:

views:

601

answers:

5

I have this awk script that runs through a file and counts every occurrence of a given date. The date format in the original file is the standard date format, like this:

Thu Mar 5 16:46:15 EST 2009
I use awk to throw away the weekday, time, and timezone, and then do my counting by pumping the dates into an associative array with the dates as indices.

In order to get the output to be sorted by date, I converted the dates to a different format that I could sort with bash sort.

Now, my output looks like this:

Date    Count
03/05/2009   2
03/06/2009   1
05/13/2009   7
05/22/2009  14
05/23/2009   7
05/25/2009   7
05/29/2009  11
06/02/2009  12
06/03/2009  16

I'd really like the output to have more human readable dates, like this:

Mar  5, 2009
Mar  6, 2009
May 13, 2009
May 22, 2009
May 23, 2009
May 25, 2009
May 29, 2009
Jun  2, 2009
Jun  3, 2009

Any suggestions for a way I could do this? If I could do this on the fly when I output the count values that would be best.

UPDATE: Here's my solution incorporating ghostdog74's example code:

grep -i "E[DS]T 2009" original.txt | awk '{printf "%s %2.d, %s\r\n",$2,$3,$6}' >dates.txt #outputs dates for counting
date -f dates.txt +'%Y %m %d' | awk ' #reformat dates as YYYYMMDD for future sort
  {++total[$0]} #pump dates into associative array
  END { 
    for (item in total) printf "%s\t%s\r\n", item, total[item] #output dates as yyyy mm dd with counts
  }' | sort -t \t | awk ' #send to sort, then to cleanup
  BEGIN {printf "%s\t%s\r\n","Date","Count"}
  {t=$1" "$2" "$3" 0 0 0" #cleanup using example by ghostdog74
   printf "%s\t%2.d\r\n",strftime("%b %d, %Y",mktime(t)),$4
  }'
rm dates.txt

Sorry this looks so messy. I've tried to put clarifying comments in.

A: 

Gawk has strftime(). You can also call the date command to format them (man). Linux Forums gives some examples.

mcandre
+1  A: 

if you are using gawk

awk 'BEGIN{
    s="03/05/2009"
    m=split(s,date,"/")
    t=date[3]" "date[2]" "date[1]" 0 0 0"
    print strftime("%b %d",mktime(t))
}'

the above is just an example, as you did not show your actual code and so cannot incorporate it into your code.

ghostdog74
See my other comment on Dennis's solution, but strftime("%b %e",mktime(t)) is actually closer to what I wanted.
dtjohnso
+1  A: 

Why don't you prepend your awk-date to the original date? This yields a sortable key, but is human readable.

(Note: to sort right, you should make it yyyymmdd)

If needed, cut can remove the prepended column.

xtofl
+2  A: 

I get testy when I see someone using grep and awk (and sed, cut, ...) in a pipeline. Awk can fully handle the work of many utilities.

Here's a way to clean up your updated code to run in a single instance of awk (well, gawk), and using sort as a co-process:

gawk '
    BEGIN {
        IGNORECASE = 1
    }
    function mon2num(mon) {
        return(((index("JanFebMarAprMayJunJulAugSepOctNovDec", mon)-1)/3)+1)
    }
    / E[DS]T [[:digit:]][[:digit:]][[:digit:]][[:digit:]]/ {
        month=$2
        day=$3
        year=$6
        date=sprintf("%4d%02d%02d", year, mon2num(month), day)
        total[date]++
        human[date] = sprintf("%3s %2d, %4d", month, day, year)
    }
    END {
        sort_coprocess = "sort"
        for (date in total) {
            print date |& sort_coprocess
        }
        close(sort_coprocess, "to")
        print "Date\tCount"
        while ((sort_coprocess |& getline date) > 0) {
            print human[date] "\t" total[date]
        }
        close(sort_coprocess)
    }
' original.txt
glenn jackman
Thanks! I wondered if there was a way to do this all in [g]awk, but I'm obviously not good enough with it. I like your year-indifferent match pattern too.
dtjohnso
another way is to use gawk's own asort or asorti() routine
ghostdog74
+2  A: 

Use awk's sort and date's stdin to greatly simplify the script

Date will accept input from stdin so you can eliminate one pipe to awk and the temporary file. You can also eliminate a pipe to sort by using awk's array sort and as a result, eliminate another pipe to awk. Also, there's no need for a coprocess.

This script uses date for the monthname conversion which would presumably continue to work in other languages (ignoring the timezone and month/day order issues, though).

The end result looks like "grep|date|awk". I have broken it into separate lines for readability (it would be about half as big if the comments were eliminated):

grep -i "E[DS]T 2009" original.txt | 
date -f - +'%Y %m %d' | #reformat dates as YYYYMMDD for future sort
awk ' 
BEGIN { printf "%s\t%s\r\n","Date","Count" }

{ ++total[$0] #pump dates into associative array }

END {
    idx=1
    for (item in total) {
        d[idx]=item;idx++ # copy the array indices into the contents of a new array
    }
    c=asort(d) # sort the contents of the copy
    for (i=1;i<=c;i++) { # use the contents of the copy to index into the original
        printf "%s\t%2.d\r\n",strftime("%b %e, %Y",mktime(d[i]" 0 0 0")),total[d[i]]
    }
}'
Dennis Williamson
Very nice! Did you intend to also include this?: BEGIN { printf "%s\t%s\r\n","Date","Count" }
dtjohnso
Oops, forgot the header. Fixed.
Dennis Williamson
One other thing, what I really wanted for the output was what I get with strftime("%b %e, %Y"... not strftime("%b %d, %Y"... An easy enough fix though.
dtjohnso
Fixed .
Dennis Williamson