



Hello, i've a .csv file like this :

[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 00:58:29.793000000,,
[email protected],2009-11-27 00:58:29.646465785,,

I have to remove similar e-mails ( the entire line ) in the file. The problem is how to use 'uniq' on the field 1 ( separated by comma ) ? According to man uniq doesn't have options for columns.

I tried something with sort | uniq but it doesn't work :/ Thank you,


well, simpler than isolating the column with awk, if you need to remove everything with a certain value for a given file, why not just do grep -v:

e.g. to delete everything with the value "col2" in the second place line: col1,col2,col3,col4

grep -v ',col2,' file > file_minus_offending_lines

If this isn't good enough, because some lines may get improperly stripped by possibly having the matching value show up in a different column, you can do something like this:

awk to isolate the offending column: e.g.

awk -F, '{print $2 "|" $line}'

the -F sets the field delimited to ",", $2 means column 2, followed by some custom delimiter and then the entire line. You can then filter by removing lines that begin with the offending value:

 awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE

and then strip out the stuff before the delimiter:

awk -F, '{print $2 "|" $line}' | grep -v ^BAD_VALUE | sed 's/.*|//g'

(note -the sed command is sloppy because it doesn't include escaping values. Also the sed pattern should really be something like "[^|]+" (i.e. anything not the delimiter). But hopefully this is clear enough.

Steve B.

By sorting the file with sort first, you can then apply uniq.

It seems to sort the file just fine:

$ cat test.csv
[email protected],2009-11-27 00:58:29.793000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 00:58:29.646465785,, 
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,

$ sort test.csv
[email protected],2009-11-27 00:58:29.646465785,, 
[email protected],2009-11-27 00:58:29.793000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,

$ sort test.csv | uniq
[email protected],2009-11-27 00:58:29.646465785,, 
[email protected],2009-11-27 00:58:29.793000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,

You could also do some AWK magic:

$ awk -F, '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' test.csv
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 01:05:47.893000000,,
[email protected],2009-11-27 00:58:29.646465785,,
Mikael S
+7  A: 
sort -u -t, -k1,1 file
  • -u for unique
  • -t, so comma is the delimiter
  • -k1,1 for the key field 1

Test result:

[email protected],2009-11-27 00:58:29.793000000,, 
[email protected],2009-11-27 01:05:47.893000000,,
Carl Smotricz
Nice, I didn't know you could do that with sort.
Steve B.
Wow very nice, that's exactly what I was looking for. Thanks a lot !

or if u want to use uniq:

cat mycvs.cvs | tr -s ',' ' ' | awk '{print $3" "$2" "$1}' | uniq -c -f2


1 01:05:47.893000000 2009-11-27 [email protected]
2 00:58:29.793000000 2009-11-27 [email protected]
Carsten C.
I'd like to point out a possible simplification: You can dump the `cat`! Rather than piping into tr, just let tr read the file using `<`. Piping through `cat` is a common unnecessary complication used by novices. For large amounts of data there's a performance effect to be had.
Carl Smotricz
Good to know. Thx! (Of course this makes sense, thinking of "cat" and "lazyness" ;))
Carsten C.
awk -F"," '!_[$1]++' file