ansaurus

Question

Bash script to find the frequency of every letter in a file

Answer 1

+2 A:

Here is a suggestion:

while read -n 1 c
do
    echo "$c"
done < "$INPUT_FILE" | grep '[[:alpha:]]' | sort | uniq -c | sort -nr

Benoit 2010-10-19 09:17:01

Thank you for replying.

SkypeMeSM 2010-10-19 09:42:56

Answer 2

+2 A:

Just one awk command

awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file

if you want case insensitive, add tolower()

awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file

and if you want only characters,

awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file

and if you want only digits, change /[a-zA-Z]/ to /[0-9]/

if you do not want to show unicode, do export LC_ALL=C

ghostdog74 2010-10-19 09:21:37

Thanks you for your reply.

SkypeMeSM 2010-10-19 09:42:40

I am sorry I am not very familiar with awk. The solution works but I am getting all characters instead of just alphanumeric characters. awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++ sum++ } END{for(i in w) print i,w[i],w[i]/sum}'

SkypeMeSM 2010-10-19 10:10:17

Thanks again. I am wondering why I get results like ü 2 and é 2, when the regex is [a-zA-Z].

SkypeMeSM 2010-10-19 10:21:26

that's because gawk's regex works for unicode characters. (UTF8).

ghostdog74 2010-10-19 10:27:53

how can i remove them in that case?

SkypeMeSM 2010-10-19 11:12:39

you can do a `export LC_ALL=C`.

ghostdog74 2010-10-19 12:34:23

Yes worked like a charm. Thanks.

SkypeMeSM 2010-10-19 12:44:59

Answer 3

+3 A:

A solution with sed, sort and uniq:

sed 's/\(.\)/\1\n/g' file | sort | uniq -c

This counts all characters, not only letters. You can filter out with:

sed 's/\(.\)/\1\n/g' file | grep '[A-Za-z]' | sort | uniq -c

If you want to consider uppercase and lowercase as same, just add a translation:

sed 's/\(.\)/\1\n/g' file | tr '[:upper:]' '[:lower:]' | grep '[a-z]' | sort | uniq -c

mouviciel 2010-10-19 09:28:59

Thanks. This considers uppercase and lowercase characters as separate. How can I calculate the frequencies where we consider A and a as same?

SkypeMeSM 2010-10-19 09:42:04

Yes this works great as well. I am wondering how can I calculate the probabilities i.e. frequency/total sum. We will need to pipe the output again to sed again but I cannot figure out the regex involved?

SkypeMeSM 2010-10-19 11:22:35

You can add some `wc`, `cut`, `dc`, `tee` and other commands but it would be more juggling with plates than a maintainable work. I think that adding more features would be easier with a perl script.

mouviciel 2010-10-19 11:43:23

Thank you very very much for your help. Cheers.

SkypeMeSM 2010-10-19 12:45:33

Answer 4

A:

My solution using grep, sort and uniq.

grep -o . file | sort | uniq -c

Ignore case:

grep -o . file | sort -f | uniq -ic

dogbane 2010-10-19 12:03:10

how can I get frequency / sum(all frequencies) after this?

SkypeMeSM 2010-10-19 12:19:56

ansaurus

tags:

views:

answers:

Bash script to find the frequency of every letter in a file

related questions