views:

577

answers:

4

I'm making a shell script to find bigrams, which works, sort of.

#tokenise words
tr -sc 'a-zA-z0-9.' '\012' < $1 > out1
#create 2nd list offset by 1 word
tail -n+2 out1 > out2
#paste list together
paste out1 out2 
#clean up
rm out1 out2

The only problem is that it pairs words from the end and start of the previous sentence.

eg for the two sentences 'hello world.' and 'foo bar.' i'll get a line with ' world. foo'. Would it be possible to filter these out with grep or something?

I know i can find all bigrams containing a full stop with grep [.] but that also finds the legitimate bigrams.

+1  A: 

Just replace the paste line with this:

paste out1 out2 | grep -v '\..'

This will filter out any lines that contain a period which is not the last character of a line.

Robert Gamble
The grep expression matches (and the `-v` option excludes) anything that matches a dot followed by something that is not a dollar sign. Since the output of `tr` does not include dollars, it works, but it is not obvious that the character class was necessary. It could have been just '\..'.
Jonathan Leffler
Doh, thanks for pointing that out, fixed.
Robert Gamble
A: 

thank you very much!

please just upvote and avoid such noisy answers
bene
+2  A: 

Shell scripts can use pipes.

cat "$@" |
tr -cs "a-zA-Z0-9." '\012' |
{
old="aaa."
while read new
do
    case "$old" in
    *.) : OK;;
    *)  echo "$old $new";;
    esac
    old="$new"
done
}

The code uses cat as the universal collector of data - tr is a pure filter that does not accept any filename arguments. The basic idea is that the variable old contains the first word, and new reads the new word. When old ends with a period (as it does in the beginning), it does not form a valid bigram under your rules. If you want to remove the dots from the sentence-ending bigrams, you can use:

 echo "$old ${new%.}"

The unadorned version (with dots echoed) works with Bourne shell as well as derivatives; the version with the ${new%.} only workers with Korn shell and derivatives - not the original Bourne shell.

If you must use temporary files, then make their names contain the process ID ($$) and use trap to remove them:

tmp=${TMPDIR:-/tmp}/bigram.$$
trap 'rm -f $tmp.?; exit 1' 0 1 2 3 13 15

...code using $tmp.1, $tmp.2, etc...

rm -f $tmp.?
trap 0

Signal 1 is hangup (HUP), 2 is interrupt (INT), 3 is quit (QUIT), 13 is pipe (PIPE) and 15 is terminate (TERM); 0 is 'any exit' and is almost juju in this context. Before actually exiting, remember to cancel the exit trap, as shown.

Jonathan Leffler
+1  A: 

You may also want to browse Ken Church's "Unix for Poets" (PDF) - a classic describing solutions to similar problems.

Yuval F