ansaurus

Question

sort | uniq | xargs grep ... where lines contain spaces

Answer 1

+1 A:

The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:

tr '\n' '\000'

To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:

cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv

Edit: the position of tr in the pipe was wrong.

kmkaplan 2009-03-04 21:20:35

Answer 2

A:

My version of uniq(1) doesn't work with -z either. Looking further, its because the input needs to be null terminated.

So just avoid the null terminator, and instead be safe about your grep(1) expressions by using the -F option.

sort | uniq -d | xargs -I '{}' grep -F '{}' myfile.csv

It worked for me. Xargs already processes per-line with -I, so don't worry about the spaces if you're using {}.

ashawley 2009-03-04 21:30:13

Answer 3

+6 A:

sort -k5,5 will do the sort on fields and avoid the cut;
uniq -f 4 will ignore the first 4 fields for the uniq;
Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.

Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.

So:

tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16

Andrew Barnett 2009-03-04 21:35:07

Yes +1. and tr '\t' ',' at the end if the CSV format is important.

kmkaplan 2009-03-05 11:15:22

Answer 4

A:

You can tell xargs to use each line as an argument in its entirety using the -d option. Try:

cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv

Glomek 2009-03-04 21:46:20

You are missing quotes around a {}

kmkaplan 2009-03-05 11:13:45

Answer 5

A:

Try escaping the spaces with sed:

echo 01/01/2005 00:37 | sed 's/ /\ /g'

cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\ /g' | xargs -I '{}' grep '{}' myfile.csv

(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)

2009-03-05 14:45:10

Correction: should be two backslashes in sed expressionecho 01/01/2005 00:37 | sed 's/ /\\\\ /g'

2009-03-05 14:47:41

Answer 6

A:

This is a good candidate for awk:

BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }

Set field seperator to ',' (CSV).
Split fifth field on the space, stick result in A.
Concatenate the line number to the list of what we have already stored for that date.
Print out the line numbers for each date.

Porges 2009-03-09 15:10:47

ansaurus

tags:

views:

answers:

sort | uniq | xargs grep ... where lines contain spaces

related questions