views:

1212

answers:

8

This can be in any high-level language that is likely to be available on a typical unix-like system (Python, Perl, awk, standard unix utils {sort, uniq}, etc). Hopefully it's fast enough to report the total number of unique terms for a 2MB text file.

I only need this for quick sanity-checking, so it doesn't need to be well-engineered.

Remember, case-insensitve.

Thank you guys very much.

Side note: If you use Python, please don't use version 3-only code. The system I'm running it on only has 2.4.4.

+4  A: 

In Python 2.4 (possibly it works on earlier systems as well):

#! /usr/bin/python2.4
import sys
h = set()
for line in sys.stdin.xreadlines():
  for term in line.split():
    h.add(term)
print len(h)

In Perl:

$ perl -ne 'for (split(" ", $_)) { $H{$_} = 1 } END { print scalar(keys%H), "\n" }' <file.txt
pts
line.to_lower().split()? :)
Skurmedel
For the case insensitivity - you need h.add(term.lower())
viksit
But is that case-insensitive? If I add a "print h" line at the end, for a sample file, I get: 4set(['bar', 'Foo', 'Bar', 'foo']).Foo and foo should be the same.
Alex
Ah, I'm too slow guys, let me check your comments.
Alex
Seems to work with that correction, thanks!
Alex
Cool, I never even knew about set
Kinlan
If you like oneliners, then the following is equivalent:import sysprint len(set(term.lower() for line in sys.stdin for term in line.split()))
Ants Aasma
The perl version needs $H{lc($_)} for case insensitive as well.
mikegrb
+6  A: 

Using bash/UNIX commands:

sed -e 's/[[:space:]]\+/\n/g' $FILE | sort -fu | wc -l
Eduard - Gabriel Munteanu
+4  A: 

Using just standard Unix utilities:

< somefile tr 'A-Z[:blank:][:punct:]' 'a-z\n' | sort | uniq -c

If you're on a system without Gnu tr, you'll need to replace "[:blank:][:punct:]" with a list of all the whitespace and punctuation characters you'd like to consider to be separators of words, rather than part of a word, e.g., " \t.,;".

If you want the output sorted in descending order of frequency, you can append "| sort -r -n" to the end of this.

Note that this will produce an irrelevant count of whitespace tokens as well; if you're concerned about this, after the tr you can use sed to filter out the empty lines.

Curt Sampson
+6  A: 

In Perl:

my %words; 
while (<>) { 
    map { $words{lc $_} = 1 } split /\s/); 
} 
print scalar keys %words, "\n";
Christoffer
+3  A: 

Simply (52 strokes):

perl -nE'@w{map lc,split/\W+/}=();END{say 0+keys%w}'

For older perl versions (55 strokes):

perl -lne'@w{map lc,split/\W+/}=();END{print 0+keys%w}'
Hynek -Pichi- Vychodil
+4  A: 

Here is a Perl one-liner:

perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{print scalar keys %h}' file.txt

Or to list the count for each item:

perl -lne '$h{lc $_}++ for split /[\s.,]+/; END{printf "%-12s %d\n", $_, $h{$_} for sort keys %h}' file.txt

This makes an attempt to handle punctuation so that "foo." is counted with "foo" while "don't" is treated as a single word, but you can adjust the regex to suit your needs.

jmcnamara
A: 

Here is an awk oneliner.

$ gawk -v RS='[[:space:]]' 'NF&&!a[toupper($0)]++{i++}END{print i}' somefile
  • 'NF' means 'if there is a charactor'.
  • '!a[topuuer[$0]++]' means 'show only uniq words'.
Hirofumi Saito
+2  A: 

A shorter version in Python:

print len(set(w.lower() for w in open('filename.dat').read().split()))

Reads the entire file into memory, splits it into words using whitespace, converts each word to lower case, creates a (unique) set from the lowercase words, counts them and prints the output.

Also possible using a one liner:

python -c "print len(set(w.lower() for w in open('filename.dat').read().split()))"
gooli