views:

61

answers:

4

Good day shell lovers!

basically i have two files:

frequency.txt: (multiple lines, space separated file containing words and a frequency)

de 1711
a 936
et 762
la 530
les 482
pour 439
le 425
...

and i have a file containing "prohibited" words:

stopwords.txt: (one single line, space separated file)

 au aux avec le ces dans ...

so i want to delete from frequency.txt all the lines containing a word found on stopwords.txt

how could i do that? i'm thinking that it could be done with awk.. something like

awk 'match($0,SOMETHING_MAGICAL_HERE) == 0 {print $0}' frequency.txt > new.txt

but i'm not really sure... any ideas?? thxs in advance

+4  A: 

This will do it for you:

tr ' ' '\n' <stopwords.txt | grep -v -w -F -f - frequency.txt

-v is to invert the match
-F is to indicate that pattern is a set of newline separated fixed strings
-f to get the pattern strings from the stopwords.txt file

If you have trouble with that, because it's space delimited, you can use tr to replace spaces with newlines:

Michael Goldshteyn
And throw a `-F` in there too to make it a fraction faster (and avoid problems if any "words" contain `.` or other unusual characters).
j_random_hacker
The stop words are in one line and -f expects them to be on separate lines.
codaddict
that will not work since stopwords are one line with space.
ghostdog74
Hmmm... Also, e.g. `le` appearing in stopwords.txt will remove any line *containing* `le` (e.g. `less`, `little`).
j_random_hacker
thxs.. it works.. even if the man says that the words on stopwords.txt should be in separated linesahh wait.. maybe @j_random_hacker is rigth.. i'll verify that
pleasedontbelong
You can always also convert a space delimited file to a newline delimited one using tr: tr ' ' '\n' infile outfile
Michael Goldshteyn
my bad.. it's not working... maybe it's because of the format on stopwords.txt@Michael how could i use `tr` and pipe it into grep?
pleasedontbelong
Try the updated version
Michael Goldshteyn
@Michael: You still haven't addressed the fact that `le` will get rid of `less`, `little`, `inexcusable` etc. :-P
j_random_hacker
@michael. You will need the `-w` (or `-o`? ) option as well so you match exact words.
ghostdog74
Updated with -w, thanks j_random_hacker and ghostdog74 for pointing this out
Michael Goldshteyn
All good now, +1 :)
j_random_hacker
+5  A: 
$ awk 'FNR==NR{for(i=1;i<=NF;i++)w[$i];next}(!($1 in w))' stop.txt freq.txt
de 1711
a 936
et 762
la 530
les 482
pour 439
ghostdog74
hey thxs... seems to be working
pleasedontbelong
Nice. You clearly are *the* AWK master.
Dan Moulding
@Dan: Rightly said :) +1 to you and +1 to the master.
codaddict
+2  A: 
tr ' ' '\n' < stopwords.txt | grep -vwFf - frequency.txt

The -w to grep is crucial to avoid e.g. le in stopwords.txt from removing words containing le like less or little.

j_random_hacker
yeap... you're wrigth.. i did needed the `-w`. The awk from @ghostdog74 is working too, but using grep looks cleaner
pleasedontbelong
A: 
join -v1 <(sort frequency.txt) <(tr ' ' '\n' <stopwords.txt|sort) | sort -k2,2rn
pixelbeat