ansaurus

Question

Bash script to extract entries form log file based on dates specified in another file?

Answer 1

+1 A:

One approach is to use awk to convert the timestamps in file2.csv to file1.csv's format, then use grep -f to search through file1.csv. This should be quite fast as it will only make one pass through file1.csv.

awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
grep -f times.list file1.csv

You could combine this all into one line if you wish:

grep -f <(awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv) file1.csv

John Kugelman 2010-09-02 21:42:51

Good news: YES it worked!!!Bad news: I discovered that the log file (list1.csv) has entries for every other second. Sometimes the times in list2.csv happens to fall between two list1.csv entries! I would be OK with finding the closest second (doesn't matter before or after) in list1.csv, but is this even possible??Thanks for your help so far!

penyuan 2010-09-02 22:42:43

Answer 2

+2 A:

Taking advantage of John's answer, you could sort and join the files, printing just the columns you want (or all columns if the case). Please take a look below (note that I'm considering that you're using UNIX, like Solaris, so nawk could be faster than awk, also we don't have gawk that could facilitate even more):

# John's nice code
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
# Sorting times.list file to prepare for the join
sort times.list -o times.list
# Sorting file1.csv
sort -t, -k3,3 file1.csv -o file1.csv
# Finally joining files and printing the rows that match the times
join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 1.5......1.50 file1.csv times.list

One special particularity from this method is that you could change it in order to work in several different cases, like with different columns order, and also in cases when the key columns are not concatenated. It would be very hard to do this with grep (using regexp or not)

jyzuz 2010-09-02 22:25:02

Cool idea! Unfortunately it seems that there is still the problem of some time stamps from list2.csv falling between time stamps in list1.csv as I indicated in the comment above. This problem probably complicates matching a lot (I would be OK with finding the closest match in terms of time, not exact), but if you have any ideas that would be awesome! Thanks again.

penyuan 2010-09-02 23:07:17

Answer 3

+1 A:

If you have GNU awk (gawk), you can use this technique.

In order to match the nearest times, one approach would be to have awk print two lines for each line in file2.csv, then use that with grep -f as in John Kugelman's answer. The second line will have one second added to it.

awk -F, 'NR>1 {$1=""; print strftime("%m-%d-%y %H:%M:%S", mktime($0));
                               print strftime("%m-%d-%y %H:%M:%S", mktime($0) + 1)}' file2.csv > times.list
grep -f times.list file1.csv

This illustrates a couple of different techniques.

skip record number one to skip the header (using a match is actually better)
instead of dealing with each field individually, $1 is emptied and strftime creates the output in the desired format
mktime converts the string in the format "yyyy mm dd hh mm ss" (the -F, and the assignment to $1 removes the commas) to a number of seconds since the epoch, and we add 1 to it for the second line

Dennis Williamson 2010-09-03 03:22:35

Aha, it worked wonderfully. I wish I could mark all the answers so far as the accepted answer, as each one was instrumental in the final solution... :) Thank you all for your help!!

penyuan 2010-09-03 14:02:22

ansaurus

tags:

views:

answers:

Bash script to extract entries form log file based on dates specified in another file?

related questions