ansaurus

Question

Linux shell script to count occurance of char sequence in a text file?

Answer 1

+6 A:

One option:

echo $((`tr -d "\n" < file | sed 's/thisIsTheSequence/\n/g' | wc -l` - 1))

There are probably more efficient methods using utilities outside the core of shell - particularly if you can fit the file in memory.

bdonlan 2009-10-30 22:03:07

Answer 2

A:

use something like:

head -n LL filename | tail -n YY | grep text | wc -l

where LL is the last line of the sequence and YY is the number of lines in the sequence (i.e. LL - first line)

Preet Sangha 2009-10-30 22:05:04

Answer 3

A:

Is there ever going to be more than one newline in your sequence?

If not, one solution would be to split your sequence in half and search for the halves (e.g. search for "thisIsTh" and also for "eSequence"), then go back to the occurrences you find and take a "closer look", i.e. strip out the newlines in that area and check for a match.

Basically this is a kind of fast "filtering" of the data to find something interesting.

Artelius 2009-10-30 22:12:16

No, the sequence is 9 characters long. Lines with less than 9 characters are irrelevant to the search

jdc0589 2009-11-02 20:39:32

In that case, you can search for the two halves of the sequence. If it's broken across two lines then you'll find at least ONE of the halves. This is basically a filtering technique that works well (fast) if the halves themselves are fairly rare. But it's a bit of effort to implement.

Artelius 2009-11-02 22:15:58

Answer 4

A:

just one awk script will do, since you will processing a huge file. Doing multiple pipes can slow down things.

#!/bin/bash
awk 'BEGIN{
 search="thisIsTheSequence"
 total=0
}
NR%10==0{
  c=gsub(search,"",s)
  total+=c  
}
NR{ s=s $0 }
END{ 
 c=gsub(search,"",s)
 print "total count: "total+c
}' file

output

$ more file
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasdaasdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

$ ./shell.sh
total count: 9

ghostdog74 2009-10-31 13:50:39

ansaurus

tags:

views:

answers:

Linux shell script to count occurance of char sequence in a text file?

related questions