views:

68

answers:

2

Lets say "textfile" contains the following:

lorem$ipsum-is9simply the.dummy text%of-printing

and that you want to print each word on a separate line. However, words should be defined not only by spaces, but by all non-alphanumeric characters. So the results should look like:

 lorem
 ipsum  
 is9simply  
 the  
 dummy  
 text  
 of  
 printing

How can I accomplish this using the Bash shell?



Some notes:

+3  A: 

Use the tr command:

tr -cs 'a-zA-Z0-9' '\n' <textfile

The '-c' is for the complement of the specified characters; the '-s' squeezes out duplicates of the replacements; the 'a-zA-Z0-9' is the set of alphanumerics; the '\n' is the replacement character (newline). You could also use a character class which is locale sensitive (and may include more characterss than the list above):

tr -cs '[:alnum:]' '\n' <textfile
Jonathan Leffler
Perfect, this is exactly what I was after. Thanks!(I'm sorry I don't have enough reputation to vote up your answer)
Sv1
A: 
$ awk -f splitter.awk < textfile

$ cat splitter.awk
{
  count0 = split($0, asplit, "[^a-zA-Z0-9]")
  for(i = 1; i <= count0; ++i) { print asplit[i] }
}
DigitalRoss
thanks Ross! this is pretty cool, I've been meaning to get into the awk-universe :)
Sv1