ansaurus

Question

shell to filter prohibited words on a file

Answer 1

+4 A:

This will do it for you:

tr ' ' '\n' <stopwords.txt | grep -v -w -F -f - frequency.txt

-v is to invert the match
-F is to indicate that pattern is a set of newline separated fixed strings
-f to get the pattern strings from the stopwords.txt file

If you have trouble with that, because it's space delimited, you can use tr to replace spaces with newlines:

Michael Goldshteyn 2010-10-20 14:00:58

And throw a `-F` in there too to make it a fraction faster (and avoid problems if any "words" contain `.` or other unusual characters).

j_random_hacker 2010-10-20 14:03:36

The stop words are in one line and -f expects them to be on separate lines.

codaddict 2010-10-20 14:04:43

that will not work since stopwords are one line with space.

ghostdog74 2010-10-20 14:06:04

Hmmm... Also, e.g. `le` appearing in stopwords.txt will remove any line *containing* `le` (e.g. `less`, `little`).

j_random_hacker 2010-10-20 14:07:23

thxs.. it works.. even if the man says that the words on stopwords.txt should be in separated linesahh wait.. maybe @j_random_hacker is rigth.. i'll verify that

pleasedontbelong 2010-10-20 14:09:10

You can always also convert a space delimited file to a newline delimited one using tr: tr ' ' '\n' infile outfile

Michael Goldshteyn 2010-10-20 14:11:39

my bad.. it's not working... maybe it's because of the format on stopwords.txt@Michael how could i use `tr` and pipe it into grep?

pleasedontbelong 2010-10-20 14:13:24

Try the updated version

Michael Goldshteyn 2010-10-20 14:14:10

@Michael: You still haven't addressed the fact that `le` will get rid of `less`, `little`, `inexcusable` etc. :-P

j_random_hacker 2010-10-20 14:18:13

@michael. You will need the `-w` (or `-o`? ) option as well so you match exact words.

ghostdog74 2010-10-20 14:18:36

Updated with -w, thanks j_random_hacker and ghostdog74 for pointing this out

Michael Goldshteyn 2010-10-20 14:28:10

All good now, +1 :)

j_random_hacker 2010-10-20 14:30:10

Answer 2

+5 A:

$ awk 'FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))' stop.txt freq.txt
de 1711
a 936
et 762
la 530
les 482
pour 439

ghostdog74 2010-10-20 14:04:51

hey thxs... seems to be working

pleasedontbelong 2010-10-20 14:17:14

Nice. You clearly are *the* AWK master.

Dan Moulding 2010-10-20 14:25:25

@Dan: Rightly said :) +1 to you and +1 to the master.

codaddict 2010-10-20 14:31:24

Answer 3

+2 A:

tr ' ' '\n' < stopwords.txt | grep -vwFf - frequency.txt

The -w to grep is crucial to avoid e.g. le in stopwords.txt from removing words containing le like less or little.

j_random_hacker 2010-10-20 14:10:18

yeap... you're wrigth.. i did needed the `-w`. The awk from @ghostdog74 is working too, but using grep looks cleaner

pleasedontbelong 2010-10-20 14:23:58

Answer 4

A:

join -v1 <(sort frequency.txt) <(tr ' ' '\n' <stopwords.txt|sort) | sort -k2,2rn

pixelbeat 2010-10-21 11:58:14

ansaurus

tags:

views:

answers:

shell to filter prohibited words on a file

related questions