views:

4797

answers:

5

I have a ksh script that returns a long list of values, newline separated, and I want to see only the unique/distinct values. It is possible to do this?

For example, say my output is file suffixes in a directory:

tar
gz
java
gz
java
tar
class
class

I want to see a list like:

tar
gz
java
class
+13  A: 

You might want to look at the uniq and sort applications.

./yourscript.ksh | sort | uniq

(FYI, yes, the sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other)

EDIT:

Contrary to what has been posted by Aaron Digulla in relation to uniq's commandline options:

Given the following input:

class
jar
jar
jar
bin
bin
java

uniq will output all lines exactly once:

class
jar
bin
java

uniq -d will output all lines that appear more than once, and it will print them once:

jar
bin

uniq -u will output all lines that appear exactly once, and it will print them once:

class
java
Matthew Scharley
Does the job, thanks!
Brabster
A: 

Pipe them through sort and "uniq". This removes all duplicates.

"uniq -d" gives only the duplicates, "uniq -u" gives only the unique ones (strips single items).

Aaron Digulla
gotta sort first by the looks of it
Brabster
Yes, you do. Or more accurately, you need to group all the duplicate lines together. Sorting does this by definition though ;)
Matthew Scharley
Also, `uniq -u` is NOT the default behaviour (see the edit in my answer for details)
Matthew Scharley
A: 

For larger data sets where sorting may not be desirable, you can also use the following perl script:

./yourscript.ksh | perl -ne 'if (!defined $x{$_}) { print $_; $x{$_} = 1; }'

This basically just remembers every line output so that it doesn't output it again.

It has the advantage over the "sort | uniq" solution in that there's no sorting required up front.

paxdiablo
Note that sorting of a very large file is not an issue per se with sort; it can sort files which are larger than the available RAM+swap. Perl, OTOH, will fail if there are only few duplicates.
Aaron Digulla
Yes, it's a trade-off depending on the expected data. Perl is better for huge dataset with many duplicates (no disk-based storage required). Huge dataset with few duplicates should use sort (and disk storage). Small datasets can use either. Personally, I'd try Perl first, switch to sort if it fails.
paxdiablo
Since sort only gives you a benefit if it has to swap to disk.
paxdiablo
+2  A: 

With zsh you can do this:

zsh-4.3.9[t]% cat file
tar
gz
java
gz
java
tar
class
class
zsh-4.3.9[t]% u=($(<file)) 
zsh-4.3.9[t]% print -l ${(u)u[@]}
tar
gz
java
class

Or you can use AWK:

zsh-4.3.9[t]% awk '!_[$0]++' file    
tar
gz
java
class
radoulov
+6  A: 
./script.sh | sort -u

This is the same as monoxide's answer, but a little less verbose.

gpojd