I'm trying to create a dictionary of words from a collection of files. Is there a simple way to print all the words in a file, one per line?
+3
A:
A good start is to simply use sed
to replace all spaces with newlines, strip out the empty lines (again with sed
), then sort
with the -u
(uniquify) flag to remove duplicates, as in this example:
$ echo "the quick brown dog and fox jumped
over the lazy dog" | sed 's/ /\n/g' | sed '/^$/d' | sort -u
and
brown
dog
fox
jumped
lazy
over
quick
the
Then you can start worrying about punctuation and the likes.
paxdiablo
2009-07-14 05:31:45
A:
assuming words separated by white spaces
awk '{for(i=1;i<=NF;i++)print $i}' file
or
tr ' ' "\n" < file
if you want uniqueness:
awk '{for(i=1;i<=NF;i++)_[$i]++}END{for(i in _) print i}' file
tr ' ' "\n" < file | sort -u
with some punctuations removed.
awk '{
gsub(/["*^&()#@$,?~]/,"")
for(i=1;i<=NF;i++){ _[$i] }
}
END{ for(o in _){ print o } }' file
ghostdog74
2009-07-14 05:32:01
A:
Yes, you can write a program to do this. Start by learning how to program, python is a good starting language.
asperous.us
2009-07-14 05:32:09
@asperous.us, while Python is a wonderful language, I would suggest that a "simple way" wouldn't involve learning a brand new language, especially as the questioner specifically stated UNIX shell scripting in the tags.
paxdiablo
2009-07-14 05:39:34
@pax, actually if you look closer at the tags, there is not one that says shell-scripting. It just says "shell" and "scripting". It could mean using the shell to do some scripting. While scripting can be Perl scripting, Python scripting etc. Not to nitpick, but i think what @asperous suggest is still alright, considering nowadays these programming languages are available in most distro. About "involving the learning of a new language", i partly agree with you, on the other hand, i also partly disagree as these programming languages can sometimes make up for what shell scripting lacks.
ghostdog74
2009-07-14 05:53:06
To each their own, @ghostdog74. I read it differently, although I don't disagree with the answer enough to downvote it (I'm against downvoting "competitors" unless they're heinously wrong).
paxdiablo
2009-07-14 06:08:02
python would be fine if it were used to solve the problem.
drewster
2009-07-14 07:45:10
+1
A:
You could use grep
:
-E '\w+'
searches for words-o
only prints the portion of the line that matches
% cat temp Some examples use "The quick brown fox jumped over the lazy dog," rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit" for example text. # if you don't care whether words repeat % grep -o -E '\w+' temp Some examples use The quick brown fox jumped over the lazy dog rather than Lorem ipsum dolor sit amet consectetur adipiscing elit for example text
If you want to only print each word once, disregarding case, you can use sort
-u
only prints each word once-f
tellssort
to ignore case when comparing words
# if you only want each word once % grep -o -E '\w+' temp | sort -u -f adipiscing amet brown consectetur dog dolor elit example examples for fox ipsum jumped lazy Lorem over quick rather sit Some text than The use
rampion
2009-07-14 06:21:32
A:
Ken Church's "Unix(TM) for Poets" (PDF) describes exactly this type of application - extracting words out of text files, sorting and counting them, etc.
Yuval F
2009-07-14 07:15:39