views:

337

answers:

3

Hello. I have a tab delimited file with 5 columns and need to retrieve a count of just the number of unique lines from column 2. I would normally do this with Perl/Python but I am forced to use the shell for this one.

I have successfully in the past used *nix uniq function piped to wc but it looks like I am going to have to use awk in here.

Any advice would be greatly appreciated. (I have asked a similar question previously about column checks using awk but this is a little different and I wanted to separate it so if someone in the future has this question this will be here)

Many many thanks!
Lilly

+3  A: 

No need to use awk.

$ cut -f2 file.txt | sort | uniq | wc -l

should do it.

This uses the fact that tab is cut's default field separator, so we'll get just the content from column two this way. Then a pass through sort works as a pre-stage to uniq, which removes the duplicates. Finally we count the lines, which is the sought number.

unwind
This is great. After messing around I discovered that I can find any dupes by thiscat file.txt | awk '{print $2}' | sort | uniq -c | sort -n
Lilly Tooner
+2  A: 

I go for

$ cut -f2 file.txt | sort -u | wc -l

At least in some versions, uniq relies on the input data being sorted (it looks only at adjacent lines).

For example in the Sun docs:

The uniq utility will read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and succeeding copies of repeated adjacent input lines will not be written.

Repeated lines in the input will not be detected if they are not adjacent.

martin clayton
A: 

Hi , I have a similar problem. Please can anyone help me with a shell script or a perl. I have a flat file like this:

fruit     country
apple     germany
apple     india
banana    pakistan
banana    saudi
mango     india

I want to get a output like

apple     germany
banana    pakistan
mango     india

Is there anyways this can be done?

sud
@sud - Suggest you post your answer as a question in its own right, perhaps with a little more explanation. Are you looking for the first country for each fruit in the input?
martin clayton