views:

104

answers:

3

I've got a pretty big comma-delimited CSV log file (>50000 rows, let's call it file1.csv) that looks something like this:

field1,field2,MM-DD-YY HH:MM:SS,field4,field5...
...
field1,field2,07-29-10 08:04:22.7,field4,field5...
field1,field2,07-29-10 08:04:24.7,field4,field5...
field1,field2,07-29-10 08:04:26.7,field4,field5...
field1,field2,07-29-10 08:04:28.7,field4,field5...
field1,field2,07-29-10 08:04:30.7,field4,field5...
...

As you can see, there is a field in the middle that is a time stamp.

I also have a file (let's call it file2.csv) that has a short list of times:

timestamp,YYYY,MM,DD,HH,MM,SS
20100729180031,2010,07,29,18,00,31
20100729180039,2010,07,29,18,00,39
20100729180048,2010,07,29,18,00,48
20100729180056,2010,07,29,18,00,56
20100729180106,2010,07,29,18,01,06
20100729180115,2010,07,29,18,01,15

What I would like to do is to extract only the lines in file1.csv that have times specified in file2.csv.

How do I do this with a bash script? Since file1.csv is quite large, efficiency would also be a concern. I've done very simple bash scripts before, but really don't know how to deal with this. Perhaps some implementation of awk??? Or is there another way?

P.S. Complication 1: I manually spot checked some of the entries in both files to make sure they would match, and they do. There just needs to be a way to remove (or ignore) the extra ".7" at the end of the seconds ("SS") field in file1.csv.

P.S.S. Complication 2: Turns out the entries in list1.csv are all separated by about two seconds. Sometimes the time stamps in list2.csv fall right in between two of the entries in list1.csv! Is there a way to find the closest match in this case??

+1  A: 

One approach is to use awk to convert the timestamps in file2.csv to file1.csv's format, then use grep -f to search through file1.csv. This should be quite fast as it will only make one pass through file1.csv.

awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
grep -f times.list file1.csv

You could combine this all into one line if you wish:

grep -f <(awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv) file1.csv
John Kugelman
Good news: YES it worked!!!Bad news: I discovered that the log file (list1.csv) has entries for every other second. Sometimes the times in list2.csv happens to fall between two list1.csv entries! I would be OK with finding the closest second (doesn't matter before or after) in list1.csv, but is this even possible??Thanks for your help so far!
penyuan
+2  A: 

Taking advantage of John's answer, you could sort and join the files, printing just the columns you want (or all columns if the case). Please take a look below (note that I'm considering that you're using UNIX, like Solaris, so nawk could be faster than awk, also we don't have gawk that could facilitate even more):

# John's nice code
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
# Sorting times.list file to prepare for the join
sort times.list -o times.list
# Sorting file1.csv
sort -t, -k3,3 file1.csv -o file1.csv
# Finally joining files and printing the rows that match the times
join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 1.5......1.50 file1.csv times.list 

One special particularity from this method is that you could change it in order to work in several different cases, like with different columns order, and also in cases when the key columns are not concatenated. It would be very hard to do this with grep (using regexp or not)

jyzuz
Cool idea! Unfortunately it seems that there is still the problem of some time stamps from list2.csv falling between time stamps in list1.csv as I indicated in the comment above. This problem probably complicates matching a lot (I would be OK with finding the closest match in terms of time, not exact), but if you have any ideas that would be awesome! Thanks again.
penyuan
+1  A: 

If you have GNU awk (gawk), you can use this technique.

In order to match the nearest times, one approach would be to have awk print two lines for each line in file2.csv, then use that with grep -f as in John Kugelman's answer. The second line will have one second added to it.

awk -F, 'NR>1 {$1=""; print strftime("%m-%d-%y %H:%M:%S", mktime($0));
                               print strftime("%m-%d-%y %H:%M:%S", mktime($0) + 1)}' file2.csv > times.list
grep -f times.list file1.csv

This illustrates a couple of different techniques.

  • skip record number one to skip the header (using a match is actually better)
  • instead of dealing with each field individually, $1 is emptied and strftime creates the output in the desired format
  • mktime converts the string in the format "yyyy mm dd hh mm ss" (the -F, and the assignment to $1 removes the commas) to a number of seconds since the epoch, and we add 1 to it for the second line
Dennis Williamson
Aha, it worked wonderfully. I wish I could mark all the answers so far as the accepted answer, as each one was instrumental in the final solution... :) Thank you all for your help!!
penyuan