views:

1958

answers:

6

I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)

I'm using a bash shell via cygwin for WinXP

$ cut -d, -f 5 myfile.csv | sort | uniq -d

correctly returns a list of the duplicate dates

01/01/2005 00:22
01/01/2005 00:37
[snip]    
02/29/2009 23:54

But I cannot figure out how to feed this to grep to give me all the rows. Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.

So, given that

 $ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

doesn't work... what can I do?

I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.

Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?

+1  A: 

The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:

tr '\n' '\000'

To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:

cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

Edit: the position of tr in the pipe was wrong.

kmkaplan
A: 

My version of uniq(1) doesn't work with -z either. Looking further, its because the input needs to be null terminated.

So just avoid the null terminator, and instead be safe about your grep(1) expressions by using the -F option.

sort | uniq -d | xargs -I '{}' grep -F '{}' myfile.csv

It worked for me. Xargs already processes per-line with -I, so don't worry about the spaces if you're using {}.

ashawley
+6  A: 
  1. sort -k5,5 will do the sort on fields and avoid the cut;
  2. uniq -f 4 will ignore the first 4 fields for the uniq;
  3. Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
  4. but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.

Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.

So:

tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
Andrew Barnett
Yes +1. and tr '\t' ',' at the end if the CSV format is important.
kmkaplan
A: 

You can tell xargs to use each line as an argument in its entirety using the -d option. Try:

cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
Glomek
You are missing quotes around a {}
kmkaplan
A: 

Try escaping the spaces with sed:

echo 01/01/2005 00:37 | sed 's/ /\ /g'

cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\ /g' | xargs -I '{}' grep '{}' myfile.csv

(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)

Correction: should be two backslashes in sed expressionecho 01/01/2005 00:37 | sed 's/ /\\\\ /g'
A: 

This is a good candidate for awk:

BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
  1. Set field seperator to ',' (CSV).
  2. Split fifth field on the space, stick result in A.
  3. Concatenate the line number to the list of what we have already stored for that date.
  4. Print out the line numbers for each date.
Porges