views:

102

answers:

5

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,

Red     Ball 1 Sold
Blue    Bat  5 OnSale
............... 

So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.

I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.

Addendum: Thanks everyone for the help, can I ask one more thing? What if I wanted a count of these unique values as well?

I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.

+7  A: 

You can make use of cut, sort and uniq commands as follows:

cat input_file | cut -f 1 | sort | uniq

gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.

Avoiding UUOC :)

cut -f 1 input_file | sort | uniq

EDIT:

To count the number of unique occurences you can make use of wc command in the chain as:

cut -f 1 input_file | sort | uniq | wc -l
codaddict
Useless use of `cat` award for the day :-)
Douglas Leeder
@Douglas: Award accepted :)
codaddict
you can also use `sort -u` instead of `sort | uniq`
Hasturkun
`uniq -c` will give the counts per item - `wc -l` will count the total number of items.
Dennis Williamson
A: 

AWK is your friend. You can write simple one-off programs for this kind of thing on the command line. Read its manual (man gawk) and you'll be enlightened.

Ian
+3  A: 

You can use awk, sort & uniq to do this, for example to list all the unique values in the first column

awk < test.txt '{print $1}' | sort | uniq

As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l

Jon Freedman
A: 

Assuming the data file is actually Tab separated, not space aligned:

<test.tsv awk '{print $4}' | sort | uniq

Where $4 will be:

  • $1 - Red
  • $2 - Ball
  • $3 - 1
  • $4 - Sold
Douglas Leeder
+1  A: 
# COLUMN is integer column number
# INPUT_FILE is input file name

cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
stacker