ansaurus

Question

How can I count unique terms in a plaintext file case-insensitively?

Answer 1

+4 A:

In Python 2.4 (possibly it works on earlier systems as well):

#! /usr/bin/python2.4
import sys
h = set()
for line in sys.stdin.xreadlines():
  for term in line.split():
    h.add(term)
print len(h)

In Perl:

$ perl -ne 'for (split(" ", $_)) { $H{$_} = 1 } END { print scalar(keys%H), "\n" }' <file.txt

pts 2009-05-27 07:19:54

line.to_lower().split()? :)

Skurmedel 2009-05-27 07:22:40

For the case insensitivity - you need h.add(term.lower())

viksit 2009-05-27 07:25:54

But is that case-insensitive? If I add a "print h" line at the end, for a sample file, I get: 4set(['bar', 'Foo', 'Bar', 'foo']).Foo and foo should be the same.

Alex 2009-05-27 07:27:17

Ah, I'm too slow guys, let me check your comments.

Alex 2009-05-27 07:27:40

Seems to work with that correction, thanks!

Alex 2009-05-27 07:29:09

Cool, I never even knew about set

Kinlan 2009-05-27 07:57:56

If you like oneliners, then the following is equivalent:import sysprint len(set(term.lower() for line in sys.stdin for term in line.split()))

Ants Aasma 2009-05-27 08:29:51

The perl version needs $H{lc($_)} for case insensitive as well.

mikegrb 2009-05-27 15:05:05

Answer 2

+6 A:

Using bash/UNIX commands:

sed -e 's/[[:space:]]\+/\n/g' $FILE | sort -fu | wc -l

Eduard - Gabriel Munteanu 2009-05-27 07:34:19

Answer 3

+4 A:

Using just standard Unix utilities:

< somefile tr 'A-Z[:blank:][:punct:]' 'a-z\n' | sort | uniq -c

If you're on a system without Gnu tr, you'll need to replace "[:blank:][:punct:]" with a list of all the whitespace and punctuation characters you'd like to consider to be separators of words, rather than part of a word, e.g., " \t.,;".

If you want the output sorted in descending order of frequency, you can append "| sort -r -n" to the end of this.

Note that this will produce an irrelevant count of whitespace tokens as well; if you're concerned about this, after the tr you can use sed to filter out the empty lines.

Curt Sampson 2009-05-27 07:34:47

Answer 4

+6 A:

In Perl:

my %words; 
while (<>) { 
    map { $words{lc $_} = 1 } split /\s/); 
} 
print scalar keys %words, "\n";

Christoffer 2009-05-27 07:38:23

Answer 5

+3 A:

Simply (52 strokes):

perl -nE'@w{map lc,split/\W+/}=();END{say 0+keys%w}'

For older perl versions (55 strokes):

perl -lne'@w{map lc,split/\W+/}=();END{print 0+keys%w}'

Hynek -Pichi- Vychodil 2009-05-27 09:19:37

Answer 6

+4 A:

Here is a Perl one-liner:

perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{print scalar keys %h}' file.txt

Or to list the count for each item:

perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{printf "%-12s %d\n", $_, $h{$_} for sort keys %h}' file.txt

This makes an attempt to handle punctuation so that "foo." is counted with "foo" while "don't" is treated as a single word, but you can adjust the regex to suit your needs.

jmcnamara 2009-05-27 09:55:37

Answer 7

A:

Here is an awk oneliner.

$ gawk -v RS='[[:space:]]' 'NF&&!a[toupper($0)]++{i++}END{print i}' somefile

'NF' means 'if there is a charactor'.
'!a[topuuer[$0]++]' means 'show only uniq words'.

Hirofumi Saito 2009-05-27 10:53:51

Answer 8

+2 A:

A shorter version in Python:

print len(set(w.lower() for w in open('filename.dat').read().split()))

Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output.

Also possible using a one liner:

python -c "print len(set(w.lower() for w in open('filename.dat').read().split()))"

gooli 2009-05-30 17:40:52

ansaurus

tags:

views:

answers:

How can I count unique terms in a plaintext file case-insensitively?

related questions